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(2)  Objective 

This  project  aims  to  develop  an  adaptive  anomaly  detection  system  based  on  Isolation  Forest,  applicable  to  data 
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time,  but  also  capable  of  detecting  abmpt  changes  in  the  underlying  concepts. 

(3  &  4)  Status  of  effort  &  Abstract 
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anomaly  detection.  We  have  written  two  papers.  The  first  paper  reports  the  new  ranking  measure  called  mass  and 
investigates  its  relationship  to  two  commonly  used  ranking  measures:  distance  and  density;  and  it  provides  evidence 
that  the  proposed  mass-based  approach,  called  Half-Space  Trees,  is  significantly  better  than  three  existing  state-of- 
the-art  distance-based  and  density-based  methods,  in  terms  of  detection  accuracy,  time  complexity  and  memory 
requirement.  We  also  show  that  Isolation  Forest  is  also  a  mass-based  approach,  uncovering  the  previously  unknown 
principle  underpinning  the  method  we  have  reported  in  2008.  The  second  paper  reports  how  the  proposed  mass- 
based  approach  can  be  adapted  to  deal  with  concept  change,  change  detection  and  model  adaptation  in  the  context  of 
data  stream.  There  is  no  equivalent  system  which  has  all  three  capabilities  that  we  are  aware  of  in  the  literature.  We 
have  also  identified  three  types  of  concept  change,  and  revealed  that  only  one  type  of  change  requires  model  update; 
whereas  model  update  for  other  changes  will  degrade  the  detection  accuracy.  In  addition,  we  have  incorporated  two 
change  categories  into  the  proposed  method:  transient  change  and  permanent  change,  in  order  to  detect  abrupt 
changes  in  the  underlying  concepts. 
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Mass:  A  New  Ranking  Measure  for 
Anomaly  Detection 

Kai  Ming  Ting,  James  Tan  Swee  Chuan  and  Fei  Tony  Liu 
Gippsland  School  of  Information  Technology, 

Monash  University,  Australia. 
kaiming.ting(3)infotech. monash. edu.au 


Abstract — Ranking  measure  is  of  prime  importance  in  anomaly  detection  tasks  because  it  is  required  to  rank  the  instances  from  the 
most  anomalous  to  the  most  normal.  This  paper  investigates  the  underlying  assumptions  and  definitions  used  for  ranking  in  existing 
anomaly  detection  methods;  and  it  has  three  aims:  First,  we  show  evidence  that  the  two  commonly  used  ranking  measures — distance 
and  density — cannot  accurately  rank  clustered  anomalies  in  anomaly  detection  tasks.  We  introduce  a  new  measure — mass,  which  can 
accurately  rank  both  scattered  and  clustered  anomalies.  Second,  we  propose  a  definition  of  anomaly  based  on  this  new  measure  and 
contrast  it  with  the  current  definitions  based  on  distance  and  density.  We  identify  the  strengths  and  weaknesses  of  these  definitions,  and 
demonstrate  the  advantages  of  the  new  definition  based  on  mass.  Third,  we  propose  a  mass-based  approach  for  anomaly  detection 
called  Half-Space  Tree  and  show  that  it  performs  favourably  to  three  existing  state-of-the-art  distance-based  and  density-based  anomaly 
detection  methods  in  term  of  detection  accuracy,  runtime  and  memory  space  requirements. 

Index  Terms — Anomaly  Detection,  Ranking  Measures,  Mass,  Scattered  Anomalies,  Clustered  Anomalies. 

-  ♦  - 


1  Introduction 

Anomalies  are  data  patterns  that  have  different  data  charac¬ 
teristics  from  normal  instances.  The  detection  of  anomalies 
often  provides  critical  actionable  information  and  brings  about 
significant  impact  on  the  task  at  hand.  For  example,  anomalies 
in  credit  card  transactions  could  signify  a  fraudulent  use  of 
credit  cards  [2].  Abnormal  patient  condition  could  indicate  a 
disease  outbreak  in  a  specific  area  [34].  An  unusual  computer 
network  traffic  pattern  could  signify  an  unauthorised  access 
[15].  These  applications  demand  anomaly  detectors  with  high 
detection  rates  and  fast  execution. 

Anomaly  detection  usually  involves  ranking  a  set  of  in¬ 
stances,  from  the  most  anomalous  instances  at  the  top  of  the 
ranked  list  to  the  most  normal  instances  at  the  bottom.  Since 
this  is  a  ranking  problem,  the  ranking  measure  is  of  prime 
importance  to  the  success  of  the  method.  Distance  and  density 
are  the  two  most  commonly  used  basic  ranking  measures, 
and  we  refer  to  anomaly  detection  methods  based  on  them  as 
distance-based  and  density-based  methods,  respectively,  in  this 
paper.  There  are  many  variants  of  these  two  basic  measures 
such  as  Local  Outlier  Factor  [9] — a  density  measure  based 
on  k-nearest  neighbours;  and  multi-granularity  deviation  factor 
[26] — a  density  measure  based  on  regions  with  a  fixed  radius. 

This  paper  investigates  the  underlying  assumptions  and  def¬ 
initions  for  ranking  anomalies  which  underpin  many  existing 
anomaly  detection  approaches.  We  show  that  these  assump¬ 
tions  and  definitions  are  only  good  for  detecting  scattered 
anomalies  but  fail  to  detect  clustered  anomalies — a  problem 
known  as  ‘masking  effect’  [25]  in  statistics  literature  [7,  27]. 
These  anomalies  group  in  cluster(s)  and  are  difficult  to  detect 
because  of  their  close  promixity  to  each  other,  thus  have  high 
density — the  exact  opposite  to  the  assumptions  that  anomalies 


are  far  from  normal  instances  and  have  low  density  in  order  for 
distance-based  and  density-based  methods  to  function  well. 

It  is  important  to  incorporate  the  ability  to  detect  clus¬ 
tered  anomalies  in  detectors.  The  inability  to  detect  clustered 
anomalies  by  existing  methods  are  a  ‘loophole’  that  can  be 
exploited  by  fraudsters  or  organsisms  that  have  evolved  to 
evade  detection.  For  example,  credit  card  fraudsters  might 
incur  a  short  burst  of  abnormal  transactions  that  masquerate 
their  abnormality  with  their  high  occurence.  This  has  high 
financial  cost  if  they  are  not  detected  quickly.  Similarly,  viruses 
might  mutate  to  exploit  this  weakness  that  can  cause  many 
lives  if  they  evade  detection  for  a  prolong  period  of  time. 

This  paper  has  three  aims.  First,  we  propose  a  new  ba¬ 
sic  ranking  measure  called  mass  for  anomaly  detection  and 
demonstrate  that  it  has  different  properties  from  the  two  com¬ 
monly  used  ranking  measures:  distance  and  density.  Second, 
we  propose  a  definition  of  anomaly  based  on  this  new  measure 
and  contrast  it  with  the  current  definitions  based  on  distance 
and  density.  We  identify  the  strengths  and  weaknesses  of  these 
definitions,  and  demonstrate  the  advantages  of  the  new  defini¬ 
tion  based  on  mass.  Third,  we  devise  a  mass-based  approach 
for  anomaly  detection,  and  we  show  evidence  that  mass  is 
indeed  a  better  ranking  measure  than  distance  and  density'; 
and  the  mass-based  approach  has  a  better  performance  in  terms 
of  detection  accuracy,  runtime  and  memory  space  requirements 
than  existing  state-of-the-art  distance-based  anomaly  detectors 
such  as  ORCA  [8]  and  one-class  SVM  [31],  and  density-based 
anomaly  detector  LOF  [9]. 

The  rest  of  the  paper  is  organised  as  follows.  Section  2 
introduces  the  proposed  new  ranking  measure:  mass.  Section 

1 .  Note  that  a  dichotomy  into  anomalies  and  normal  points  is  a  special  case 
of  ranking,  which  is  used  by  some  anomaly  detectors. 
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3  describes  various  definitions  used  for  ranking  anomalies  and 
contrasts  the  difference  with  the  definition  based  on  mass. 
Section  4  introduces  the  mass-based  approach  to  anomaly 
detection.  Section  5  provides  an  empirical  evaluation,  and  Sec¬ 
tion  6  stipulates  a  comparison  of  time  and  space  complexities 
among  anomaly  detection  methods.  Section  7  describes  the 
connection  of  the  proposed  mass-based  approach  to  a  closely 
related  method.  We  describe  the  related  work  and  conclusions 
in  the  last  two  sections. 


2  A  NEW  RANKING  MEASURE:  MASS 


Distance  and  density  are  the  two  basic  ranking  measures 
commonly  used  to  rank  instances  for  anomaly  detection.  Dis¬ 
tance  is  usually  defined  based  on  some  distance  or  similarity 
metric;  the  typical  distance  metrics  are  Euclidean,  Lp-norm  or 
Mahalanobis  distance  measures. 

The  common  definition  for  density  is  the  number  of  data 
point  per  unit  space^.  The  other  possible  definition  for  density 
is  based  on  a  kernel  function  i.e.,  kernel  density  estimation  in 
which  a  kernel  function  is  placed  in  each  point  and  the  overall 
density  is  the  sum  of  the  kernel  function  applied  to  all  points. 
In  other  words,  density  is  defined  against  some  standard  (unit 
space  or  kernel  function). 

Definition  1  :  Data  mass  or  mass,  the  new  ranking  measure 
we  proposed,  is  defined  as  the  number  of  points  in  a  region; 
and  two  groups  of  data  can  have  the  same  mass  regardless 
of  the  characteristics  of  the  regions  (e.g.,  density,  shape  and 
size). 

Let  X  a  feature  space  and  Ri  a  subspace,  Ri  C  X.  The 
mass  in  subspace  Ri  is  the  total  number  of  instances  in  that 
subspace,  denoted  as  m{Ri)',  The  basis  function  for  mass 
distribution  is  a  rectangular  function  which  is  defined  as 


J  m{Ri)  if  X  G  Ri 

[  0  otherwise 


A  mass  distribution  based  on  f{x)  has  the  following  prop¬ 
erties: 

•  The  height  of  the  distribution  indicates  the  mass  of  the 
data  in  their  local  region  rather  than  denseness. 

•  Any  two  uni-modal  density  distributions  which  possess 
the  same  data  mass  will  have  the  same  height  in  their 
mass  distributions,  regardless  of  their  densities. 

All  the  above-mentioned  properties^  are  illustrated  using  a 
simple  example  in  Figure  1,  in  contrast  to  density  distribution 
estimated  using  histogram. 

A  ‘smoothed’  mass  distribution  can  be  obtained  by  using 
an  ensemble  method  from  the  rectangular  basis  function  f{x), 
without  explicitly  defining  a  region. 

Let  {Rf ,  R2 ,  ■  ■  ■ ,  Rf}  the  set  of  t  subspaces  which  all  cover 
point  X,  i.e.,  Rf  D  {x};  and  each  subspace  Rf  is  a  random 
‘variant’  of  each  other. 


C1 


C2 


C1 


C2 


Fig.  1.  Mass  versus  density:  Two  regions  having  the 
same  number  of  instances  (1000  data  points  each),  with 
different  densities  (sd=0.125  and  1.0,  respectiveiy),  have 
the  same  mass  shown  as  rectanguiar  functions  of  the 
same  height.  The  histograms  depict  the  differing  densities 
of  the  two  regions. 


Fig.  2.  An  exampie  of  three  subspaces  covering  x. 


Definition  2  :  The  mass  distribution,  mass{x),  estimates 
mass  at  point  x  by  a  simple  averaging  over  all  masses  in  all 
subspaces  Rf  covering  x  as  follows. 


mass{x) 


t 


(1) 


An  example  of  Rf  is  shown  in  Figure  2.  Note  that  this 
mass  definition  applies  to  any  data  distribution  because  no 
assumption  of  data  distribution  in  Rf  is  made.  In  practice,  Rf 
can  be  generated  randomly  without  explicitly  searching  for 
one.  We  will  show  such  a  mass-based  approach  in  Section  4. 

Note  that  the  resultant  mass  distribution  derived  from 
mass(x)  is  a  ‘smoothed’  distribution,  which  stipulates  a 
gradation  of  the  mass  between  adjacent  points.  Examples  of 
the  ‘smoothed’  distribution  will  be  showed  in  the  next  section 
(in  Figure  3). 

mass{x)  can  then  be  used  to  rank  instances  according  to 
their  masses — instances  with  low  mass  are  more  likely  to  be 
anomalies  than  instances  with  high  mass.  In  this  paper,  we 
show  that  the  ranking  provided  by  mass  is  good  for  detecting 
both  scattered  anomalies  and  clustered  anomalies;  but  the 
ranking  provided  by  either  density  or  distance  is  only  good 
for  scattered  anomalies.  These  will  be  discussed  in  Section 
3.2.  We  will  first  discuss  how  ranking  measures  are  used  for 
anomaly  detection  in  the  next  section. 


3  Ranking  for  anomaly  detection 
3.1  Definitions  of  anomaiies 


2.  A  variant  is  defined  as  number  of  points  per  unit  distance  when  k-nearest 
neighbours  are  used  to  measure  density — -a  ratio  of  k  and  the  sum  of  distances 
to  each  of  the  k  nearest  neighbours. 

3.  It  is  interesting  to  note  that  there  is  only  one  way  to  compute  mass,  given 
a  region;  but  there  are  many  ways  to  compute  density  or  distance,  depending 
on  the  metric  used. 


There  are  a  number  of  accepted  definitions  for  anomalies. 
Some  prevalent  definitions  are: 

‘An  observation  which  deviates  so  much  from  other  obser¬ 
vations  as  to  arouse  suspicion  that  they  were  generated  by  a 
different  mechanism’  [18]. 
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‘An  observation  which  appears  to  be  inconsistent  with  the 
remainder  of  that  set  of  data’  [7]. 

The  key  property  of  anomalies  highlighted  in  these  and 
other  definitions  is:  ‘different’,  and  the  implicit  assumption 
is  that  anomalies  are  ‘few’  with  respect  to  the  normal  points. 
We  will  do  our  analysis  in  terms  of  ‘few’  and  ‘different’  in 
the  following. 

We  focus  on  definitions  used  in  the  distance-based  and 
density-based  anomaly  detection  approaches  because  they  are 
the  prevalent  approaches.  The  definitions  commonly  used  are: 

(i)  D-Distance  Anomalies  are  data  points  which  have  fewer 
than  p  neighboring  points  within  a  distance  D  [22]. 

(ii)  /cth  NN  Distance  Anomalies  are  the  top-ranked  in¬ 
stances  whose  distance  to  the  fcth  nearest  neighbor  is  greatest 
[29]. 

(iii)  Average  kNN  Distance  Anomalies  are  the  top-ranked 
instances  whose  average  distance  to  the  k  nearest  neighbors 
is  greatest  [4]. 

(iv)  Density-based  Anomalies  are  instances  which  are  in 
regions  of  low  density  or  low  local/relative  density. 

Definition  (i)  uses  distance  D  to  differentiate  one  region 
from  another  and  uses  the  number  of  instances  within  each 
region  to  define  ‘few’.  This  is  very  restrictive  because  the 
regions  are  constrained  to  a  specific  shape  defined  by  the 
distance  metric  employed,  and  D  is  fixed  globally.  Definitions 
(ii)  and  (iii)  measure  ‘different’  in  terms  of  distance,  and  k 
controls  the  number  of  instances  to  be  considered  as  ‘few’ — 
an  (anomaly)  region  with  instances  more  than  k  will  not  be 
considered  as  anomalies  using  this  definition.  Definition  (iv) 
correctly  refers  to  ‘few’  when  low  density  regions  have  small 
number  of  instances;  but  this  definition  assigns  low  density 
regions  which  have  many  instances  as  anomaly  regions,  and 
high  density  regions  which  have  few  instances  as  normal 
regions — both  of  which  are  counter-intuitive.  These  definitions 
are  good  for  detecting  scattered  anomalies  (regions  with  one 
or  two  instances  thus  low  density  which  are  far  from  normal 
instances)  but  fail  to  detect  clustered  anomalies  (regions  with 
few  instances  but  high  density.) 

In  contrast,  the  definition  based  on  data  mass  is 

(v)  ‘Mass-based  Anomalies  are  instances  which  are  in 
regions  of  low  mass,  regardless  of  density,  shape  and  size  of 
the  regions.’ 

‘Low  mass’  refers  specifically  to  ‘few’,  and  ‘mass’  by  its 
definition  refers  to  a  region  that  is  ‘different’  from  other 
regions. 

We  argue  that  the  definition  of  mass-based  anomalies  can 
better  capsulate  the  true  nature  of  anomalies  than  either 
distance-based  or  density-based  definitions. 

3.2  Scattered  and  clustered  anomalies 

Here  we  examine  the  ability  of  the  five  definitions  mentioned 
in  the  last  section  to  rank  scattered  anomalies  as  well  as 
clustered  anomalies.  We  first  contrast  the  density-based  and 
mass-based  definitions;  and  then  show  that  the  distance-based 
definition  has  the  same  ‘deficiency’  as  the  density-based 
definition. 

Let  region  R{m,  d)  has  m  instances  and  density  d,  calcu¬ 
lated  by  a  density  estimation  method.  Assume  that  we  have 


a  set  of  t  regions  Ri,  R2,  R3,  ■■■,  Rt-  Consider  the  following 
four  scenarios: 

(a)  The  regions  have  the  same  d  but  increasing  masses. 

(b)  The  regions  range  from  dense  small  regions  to 
sparse  large  regions:  Ri{m,td),  i?2(2m,  {t—l)d),R3{3m,  {t— 
2)d), Rt{tm,d). 

(c)  The  regions  range  from  sparse  small  regions  to  dense 

large  regions:  Ri{m,d),R2{2m,{2d),R3{3m,3d), . , 

Rn{tm,  td). 

(d)  The  regions  have  the  same  m  but  increasing  densities. 
The  series  of  masses  and  densities  for  each  scenario  is 

shown  in  Table  1. 


Scenario 

1  Mass 

Density 

(a) 

m,  2m,  3m, . 

tm 

d,  ft,  ft, ...,  d 

(b) 

m,  2m,  3m, . 

tm 

td^  {t  —  l)ft,  it  —  2)ft, ...,  d 

(c) 

m,  2m,  3m, . 

tm 

d,  2ft,  3ft, ...,  tft 

(d) 

m,  m,  m, .. 

.,  m 

ft,  2ft,  3ft, ...,  tft 

TABLE  1 

Series  of  masses  and  densities  of  t  regions  in  four 
different  scenarios:  (a),  (b),  (c),  (d). 


We  highlight  the  following  observations: 

•  The  definition  for  density-based  anomalies  ranks  all  re¬ 
gions  equally  in  scenario  (a),  and  ranks  the  sparse  large 
regions  in  front  of  the  dense  small  regions  in  scenario 
(b).  Both  of  the  above  rankings  are  counter-intuitive — 
small  regions  (because  of  being  ‘few’)  are  obviously  the 
more  likely  candidates  of  anomalies  than  large  regions. 
Scenario  (b)  is  the  case  in  which  there  are  clustered 
anomalies  which  might  be  denser  than  the  normal  regions. 

•  The  definition  for  mass-based  anomalies  ranks  small 
regions  before  the  large  regions  in  both  scenarios  (a)  and 
(b),  regardless  of  the  densities. 

•  Scenario  (c)  is  an  easy  case  for  all  measures  for  which  the 
rankings  are  identical  and  in  the  right  order.  When  to  =  1, 
this  scenario  is  the  one  which  has  scattered  anomalies  and 
dense  normal  regions. 

•  In  scenario  (d),  it  is  arguable  that  the  regions  should  be 
ranked  at  all  in  terms  of  anomaly  ranking:  for  large  to, 
they  are  all  unlikely  to  be  anomalies;  for  small  to,  say  1  or 
2  instances,  the  density  estimation  is  inaccurate,  rendering 
the  ranking  meaningless. 

Figure  3  shows  an  example  in  which  the  anomaly  defini¬ 
tions  based  on  distance  and  density  (defined  using  k  nearest 
neighbours)  will  fail  to  rank  the  dense  small  region  (having  20 
points  denoted  as  Cl)  ahead  of  the  sparse  large  region  (having 
1000  points  denoted  as  C2);  whereas  the  mass  measure  ranks 
them  correctly.  The  ‘smoothed’  mass  distribution  is  obtained 
from  an  ensemble  of  50  mass-based  models  called  HS*-Trees. 
This  mass-based  ensemble  approach  is  to  be  introduced  in  the 
next  section. 

4  Mass-based  approach 

This  section  has  three  aims.  First,  we  propose  a  practical 
anomaly  detector  which  provides  an  effective  ranking  using 
mass.  Second,  we  compare  the  three  ranking  measures  using 
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Fig.  3.  A  comparison  of  different  ranking  measure  dis¬ 
tributions  provided  by  mass,  (a)  density  (/cNN)  and  (b) 
distance  using  an  exampie  of  a  dense  smaii  region 
C1  of  20  points  (on  the  ieft)  and  a  sparse  iarge  region  C2 
of  1000  points  (on  the  right).  Each  ‘X’  symboi  in  the  iast 
row  denotes  one  point  in  one-dimension  (x-axis);  note  that 
Cl  appears  to  have  oniy  one  ‘X’  because  the  20  points  are 
very  dose  to  each  other,  i.e.,  very  dense. 


the  same  anomaly  detector  and  show  that  mass  is  effective  for 
both  scattered  and  clustered  anomalies  whereas  the  other  two 
measures  are  not.  Third,  we  provide  empirical  evidence  that 
the  proposed  mass-based  approach  performs  better  than  the 
state-of-the-art  anomaly  detectors  which  employ  density  and 
distance  ranking  measures. 

4.1  Half-Space  Tree 

The  motivation  of  the  proposed  method,  Half-Space  Tree, 
comes  from  the  fact  that  equal-size  subspaces  contain  the  same 
mass  in  a  space  with  uniform  mass  distribution,  regardless  of 
the  shapes  of  the  subspaces.  This  is  shown  in  Figure  4(a), 
where  the  space  enveloped  by  the  data  is  split  into  equal- 
size  half-spaces  recursively  three  times  into  eight  subdivisions. 
Note  that  the  shapes  of  the  eight  subdivisions  may  be  different 
because  the  splits  at  the  same  level  may  not  use  the  same 
attribute. 
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(a)  Uniform  mass  distribution. 
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(b)  Non-uniform  mass  distribution. 


Fig.  4.  Half-space  subdivisions  of:  (a)  uniform  mass 
distribution;  and  (b)  non-uniform  mass  distribution. 


The  binary  half-space  split  ensures  that  every  split  produces 
two  equal-size  half-spaces,  each  containing  exactly  half  of 
the  mass  before  the  split  under  a  uniform  mass  distribution. 
This  characteristic  enables  us  to  compute  the  relationship 
between  any  subdivisions  easily.  For  example,  the  mass  in 
every  subdivision  shown  in  Figure  4(a)  is  the  same,  and  it  is 
equivalent  to  the  original  mass  divided  by  2^  because  three 
levels  of  binary  half-space  subdivisions  have  been  applied. 
A  deviation  from  the  uniform  mass  distribution  allows  us  to 
rank  the  subdivisions  based  on  mass.  Figure  4(b)  provides 
such  an  example  in  which  a  ranking  of  subdivisions  based 
on  mass  provides  an  order  of  the  degrees  of  anomaly  in  each 
subdivision. 

In  the  following  two  subsections,  we  first  provide  definitions 
for  the  proposed  Half-Space  Tree  method  and  its  two  different 
variants;  then,  we  present  the  Half-Space  Tree  algorithm. 


4.2  Definitions 

Definition  3  :  Half-Space  Tree  is  a  binary  tree  in  which  each 
internal  node  makes  a  half-space  split  into  two  equal-space 
subdivisions,  and  each  external  node  terminates  further  splits. 
All  nodes  record  the  mass  of  the  training  data  in  their  own 
subdivisions. 

Let  R[i]  be  a  half-space  subdivision  at  depth  level  i  with 
mass  m{R[i])  or  short  for  m[i\. 

Definition  4  :  Equivalence  of  mass  between  any  two  subdivi¬ 
sions  is  expressed  with  reference  Xo  m[i  =  0]  at  depth  level=0 
(the  root)  of  a  Half-Space  Tree. 

Under  uniform  mass  distribution,  the  mass  at  level  i  is 
related  to  mass  at  level  0  as  follows: 

m[i  =  0]  =  m[i]  x  2*, 
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Fig.  5.  Flalf-Space  Tree:  (a)  FIS-Tree:  An  FIS-Tree  for  the 
data  shown  in  Figure  4a  has  m*  =  4,VT  which  are  m[i  = 
3]  (i.e.,  mass  at  ievei  3).  (b)  FIS*-Tree:  This  is  an  exampie 
of  a  speciai  case  of  FIS*-Tree  when  the  size  iimit  is  set  to 
1. 

or  masses  between  any  subdivisons  at  levels  i  and  j  are  related 
as  follows: 

m[i]  X  2*  =  m[j]  x  2T 

Under  non-uniform  mass  distribution,  the  following  inequal¬ 
ity  establishes  an  ordering  between  subdivisions  at  different 
levels: 

m[i\  X  2*  <  m[j\  X  2T 

We  employ  the  above  property  to  rank  instances  and  define 
the  anomaly  score  s  based  on  (augmented)  mass  for  Half- 
Space  Tree  as  follows. 

s(x)  =  m[£]  X  2^,  (2) 

where  I  is  the  depth  level  of  an  external  node  with  m[£] 
instances  in  which  a  test  instance  x  falls  into. 

The  anomaly  score  is  based  on  mass  m[f]  only  if  a  Half- 
Space  tree  has  all  external  nodes  at  the  same  depth  level. 
The  score  is  based  on  augmented  mass,  m[£]  x  2^,  if  the 
external  nodes  have  differing  depth  levels.  We  describe  two 
such  variants  of  Half-Space  Tree  below. 

HS-Tree:  Ranking  based  on  mass  only.  The  first  variant, 
HS-Tree,  builds  a  balanced  binary  tree  structure  which  makes 
a  half-space  split  at  each  internal  node  and  all  external  nodes 
have  the  same  depth.  The  number  of  training  instances  falls 
into  each  external  node  is  recorded  and  it  is  used  for  scoring 
in  testing.  An  example  of  HS-Tree  is  shown  in  Figure  5(a). 

HS*-Tree:  Ranking  based  on  augmented  mass.  Unlike 
HS-Tree,  the  second  variant,  HS*-Tree,  has  differing  depth 
levels.  The  anomaly  score  for  HS*-Tree  is  Equation  (2)  in 
order  to  account  for  different  depths.  We  call  this  score 


augmented  mass,  as  the  mass  is  augmented  in  the  calculation 
by  the  level  of  subdivision  in  HS*-Tree,  as  opposed  to  mass 
only  in  HS-Tree. 

In  a  special  case  of  HS*-Tree,  an  external  node  only 
terminates  when  the  training  data  size  is  1.  Here  the  anomaly 
score  reduces  to  depth  level  only,  i.e.,  2^  or  simply  t.  In  other 
words,  the  depth  level  becomes  a  proxy  to  mass  in  HS*- 
Tree  when  the  size  limit  is  set  to  1.  An  example  of  HS*-Tree, 
when  the  size  limit  is  set  to  1,  is  shown  in  Figure  5(b). 

Since  the  two  variants  have  similar  performance,  we  will 
focus  on  HS*-Tree  only  in  the  rest  of  this  paper;  and  use  the 
terms  ‘mass’  and  ‘augmented  mass’  interchangably,  unless  a 
distinction  is  required. 

Ensemble.  The  proposed  method  uses  a  random  subsample 
to  build  one  Half-Space  Tree,  and  multiple  Half-Space  Trees 
are  constructed  from  different  random  subsamples  (without 
replacement)  to  form  an  ensemble. 

4.3  Algorithm  for  HS*-Trees 

The  proposed  method  estimates  a  mass  distribution  efficiently, 
even  in  a  multi-dimensional  space,  without  density  or  distance 
calculations  or  clustering.  The  method  employs  an  ensemble  to 
produce  a  ‘smoothed’  mass  distribution  which  is  an  average  of 
augmented  mass  from  all  trees  in  the  ensemble,  where  every 
mass  estimate  for  an  instance  x  from  a  tree  is  assuming  a 
rectangular  function  over  the  region  in  which  x  resides.  This 
method  requires  only  a  small  data  sample  to  produce  a  mass 
distribution  that  is  suitable  for  anomaly  detection. 

Training.  The  first  step  in  generating  an  HS*-Tree  from  a 
data  sample  is  to  establish  a  working  space.  An  internal  node 
in  HS*-Tree  is  created  by  randomly  selecting  a  dimension, 
and  a  half-space  split  on  this  dimension  is  established  for  this 
node.  Then,  the  process  is  repeated  for  each  branch  until  a  size 
limit  (or  a  depth  limit)  is  reached  to  form  an  external  node. 
The  training  instances  at  the  external  node  at  depth  level  i 
form  the  mass  m[f]  to  be  used  during  the  testing  process.  The 
training  procedures  for  an  ensemble  of  HS*-Trees  are  shown 
in  Algorithms  1  and  2. 


Algorithm  1  :  HS*-Trees(A,  S',  ft.) 

Inputs:  X  -  input  data,  t  -  number  of  trees,  ip  -  sub-sampling 
size,  S  -  data  size  limit  at  external  node,  ft  -  maximum  depth 
limit 

Output:  F  -  a  set  of  f  HS*-Trees 
1:  SizeLimit  ^  S 
2:  M axDepthLimit  <—  ft 

3:  Initialize  F 

4:  for  i  =  1  to  f  do 

5:  X'  ^  sample{X,ip)  {without  replacement} 

6:  (min, min)  ^  InitialiseWorkingSpace(Ar') 

7:  F  ^  FU  SingleHS*-Tree(A'',  min,  max,  0) 

8:  end  for 
9:  return  F 


The  aim  is  to  generate  many  diverse  HS*-Trees  to  form 
an  ensemble.  This  is  achieved  by  defining  a  (random)  range 
for  each  dimension,  forming  a  working  space  which  covers 
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Algorithm  2  ;  SingleHS*-Tree(A,  min,  max,  £) 

Inputs;  X  -  input  data,  min  &  max  -  arrays  of  minimum 
and  maximum  values  for  all  attributes  in  a  working  space, 
£  -  current  depth  level 
Output;  an  HS*-Tree 

1:  if  (|A|  <  SizeLimit)  or  {£  >  MaxDepthLimit)  then 
2:  return  exNode{Size  ^  |A|) 

3:  else 

4:  randomly  select  an  attribute  q 

5:  p  ^  {maxq  +  miug) /2 

6:  Xi  ^  filter{X,  q  <  p) 

7:  Xr  ^  filter{X,  q  >  p) 

8:  {Build  two  nodes;  Left  and  Right  as  a  result  of  a  split 

into  two  equal-volume  half-spaces.} 

9:  temp  ^  maxq\  maxq  ^  p 

10:  Left  ^  SingleHS*-Tree(A;,  min,  max,  f  -f  1) 

11:  maXq  ^  temp;  miuq  ^  p 

12:  Right  ^  SingleHS*-Tree(Aj.,  min,  max,  £+1) 

13:  return  inNode{Left,  Right,  SplitAtt  ^  q, 

SplitValue  ^  p) 

14:  end  if 


all  the  training  data  of  a  subsample,  before  the  construction 
of  a  tree.  This  is  done  in  the  step  to  InitialiseWorkingSpace 
in  Algorithm  1.  For  each  attribute  q,  a  random  split  value 
(Zq)  is  chosen  within  the  range  [Dminq,DmaXq],  i.e.,  the 
minimum  and  maximum  values  of  q  in  the  subsample.  Then, 
attribute  q  of  the  working  space  is  defined  having  the  range 
[min^,  maXq]  =  [Zq  —  r,  Zq  +  r],  where  r  =  2  •  max(zq  — 
Dminq,  DmaXq  —  Zq).  The  ranges  for  all  dimensions  define 
the  working  space  to  generate  a  Half-Space  Tree.  The  outer 
rectangles  in  both  Figures  4a  and  4b  are  examples  of  a  working 
space. 

Constructing  a  single  Half-Space  Tree  is  almost  identical 
to  constructing  an  ordinary  decision  tree'*  [28],  except  that  no 
splitting  selection  criterion  is  required  at  each  node.  The  split 
point  of  a  node  in  Half-Space  Tree  is  simply  the  mid-point  in 
a  randomly  selected  dimension  of  the  working  space  defined 
above.  The  detail  procedure  is  described  in  Algorithm  2. 

Note  that  the  entire  training  procedure  does  not  require 
any  evaluation  criteria  at  all;  and  randomisation  is  invoked 
in  three  steps;  the  random  subsample  step,  the  working  space 
initialisation  step,  and  the  random  attribute  selection  at  each 
internal  node. 

Testing.  During  testing,  a  test  instance  x  traverses  through 
each  Half-Space  Tree  from  the  root  to  an  external  node,  and 
the  mass  recorded  at  the  external  node  is  used  as  the  anomaly 
score  s(x)  (i.e..  Equation  (2))  for  this  instance.  This  testing  is 
carried  out  for  all  Half-Space  Trees  in  the  ensemble,  and  the 
final  score  is  the  average  score  from  all  trees,  equivalent  to 
Equation  (1). 

Eor  a  set  of  data  set,  the  scores  obtained  are  used  to  rank  the 
instances.  Erom  this  ranking  and  the  ground  truth,  we  employ 
AUC  (Area  Under  receiver  operating  characteristic  Curve)  to 

4.  However,  they  are  for  different  tasks:  Decision  trees  are  for  supervised 
learning  tasks;  Half-Sapce  trees  are  for  unsupervised  learning  tasks. 


measure  the  performance  of  all  anomaly  detectors  reported  in 
this  paper. 

Time  and  Space  complexities.  Because  of  no  evaluations 
or  searches  at  all,  an  HS*-Tree  can  be  generated  very  fast.  In 
addition,  a  good  performing  HS*-Tree  can  be  generated  using 
only  a  small  subsample  (size  if)  from  a  given  data  set  of  size  n, 
where  if  n.  An  ensemble  of  HS*-Trees  has  training  time 
complexity  0{thif)  which  is  constant  for  an  ensemble  with 
fixed  subsample  size  if,  maximum  depth  level  h  and  ensemble 
size  t.  It  has  time  complexity  Ofthn)  during  testing.  The  space 
complexity  for  HS*-Trees  is  0{thif)  and  is  also  a  constant  for 
an  ensemble  with  fixed  subsample  size,  maximum  depth  level 
and  ensemble  size. 

5  Empirical  evaluation 

Here  we  conduct  experiments  to  evaluate  the  proposed  ranking 
measure;  mass,  and  the  proposed  mass-based  method  HS*- 
Trees.  This  section  has  two  aims.  Eirst,  we  compare  the 
proposed  ranking  measure,  mass,  with  two  commonly  used 
measures,  density  and  distance,  all  implemented  in  HS*-Trees. 
This  assesses  the  anomaly  detection  performance  of  the  three 
measures  using  the  same  algorithm.  Second,  we  assess  the 
anomaly  detection  performance  of  mass-based  method  HS*- 
Trees  in  comparison  with  three  state-of-the-art  algorithms 
which  uses  density  and  distance  ranking  measures. 

Experimental  settings.  The  default  settings  for  HS*-Trees 
are  if  =  256,  S  —  20,  h  =  20  and  t  =  100  (i.e.,  an 
ensemble  of  100  trees  is  built.)  The  performance  measures  are 
AUC — Area  Under  receiver  operating  characteristics  (ROC) 
Curve  and  CPU  run  time.  Although  AUC  or  ROC  curve 
(or  the  alternative  precision-recall  curve)  is  commonly  used 
for  supervised  learning  tasks,  it  can  also  commonly  used  for 
measuring  performance  in  anomaly  detection,  e.g.,  in  [13,  16, 
21].  AUC  ranges  from  0  (the  worse)  to  1  (the  best).  The  results 
reported  are  an  average  over  ten  runs;  each  run  is  obtained 
using  a  different  random  seed  for  all  non-deterministic  algo¬ 
rithms.  They  are  conducted  as  single  threaded  jobs  processed 
at  2.3GHz  in  a  Linux  cluster  (www.vpac.org). 

When  using  fcth  NN  distance  or  kNN  density  in  HS*-Trees 
in  the  first  experiment,  a  test  instance  traverses  from  the  root 
of  a  tree  to  the  deepest  node  which  has  at  least  k  data  points 
so  that  a  value  can  be  computed  using  this  measure.  We  use 
fc  =  5  in  our  experiments. 

The  second  experiment  compares  HS*-Trees  with  ORCA 
[8],  one-class  SVM  (first  mentioned  in  [31])  and  LOE  [9]. 
ORCA  employs  distance-based  definition  (ii),  stated  in  sec¬ 
tion  3.1,  to  rank  anomalies;  LOE  is  the  state-of-the-art  density- 
based  anomaly  detector,  designed  to  detect  both  local  and 
global  anomalies;  it  computes  local  density  and  defines  anoma¬ 
lies  as  having  low  local  densities  (i.e.,  density-based  definition 
(iv));  and  SVM  employs  a  simplified  version  of  distance 
measure. 

ORCA  is  a  distance-based  method  based  on  k-Nearest 
Neighbour  (kNN)  with  a  sample  randomisation  scheme  and 
a  pruning  rule  to  speed  up  run  time.  Our  default  parameters 
for  ORCA  are  fc  =  5  and  N  =  ^,  where  N  the  number  of 

O 

anomalies  expected.  LOE  is  a  density-based  method  based  on 
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Fig.  6.  This  figure  shows  an  exampie  of  Muicross  in  two 
dimensions.  Circies  (o)  are  normai  points  in  the  middie 
ciuster,  and  triangies  (A)  denote  anomaiies  which  are 
points  as  two  smaiier  ciusters  and  at  the  fringe  of  the 
middie  ciuster. 


data  size 

d 

anomaly  class 

Http  (KDDCUP99) 

567497 

3 

attack  (0.4%) 

ForestCover 

286048 

10 

class  4  (0.9%) 

vs.  class  2 

Muicross 

262144 

4 

2  clusters  (10%) 

Smtp  (KDDCUP99) 

95156 

3 

attack  (0.03%) 

Shuttle 

49097 

9 

classes  2, 3, 5, 6,7  (7%) 

Mammography 

11183 

6 

class  1  (2%) 

Annthyroid 

7200 

6 

classes  1,  2  (7%) 

Satellite 

6435 

36 

3  smallest 
classes  (32%) 

Pima 

768 

8 

pos  (35%) 

Breastw 

683 

9 

malignant  (35%) 

Arrhythmia 

452 

274 

classes  03,04,05,07, 
08,09,14,15  (15%) 

Ionosphere 

351 

32 

bad  (36%) 

Synthetic 

2000 

2 

anomaly  (7.5%) 

TABLE  2 

Data  characteristics  of  the  data  sets  used  in  the 
experiments,  where  d  is  the  number  of  dimensions,  and 
the  percentage  in  bracket  indicates  the  percentage  of 
anomaiies. 


k-nearest  neighbour.  LOF’s  default  parameter  is  fc  =  10.  One- 
class  SVM  is  using  the  Radial  Basis  Function  kernel  and  its 
inverse  width  parameter  is  estimated  by  the  method  suggested 
in  [10]. 

Benchmark  data  sets  are  employed  in  both  expriments.  We 
utilise  a  synthetic  data  and  an  additional  of  twelve  data  sets 
mostly  from  the  UCI  repository  [5],  which  include  many  real- 
world  data  sets,  e.g.,  two  from  KDD  CUP  99,  one  Mammog¬ 
raphy  data  set  ^  and  one  anomaly  data  generator  Muicross 
[30]  which  generates  data  with  both  clustered  and  scattered 
anomalies.  An  example  of  Muicross  is  shown  in  Figure  6. 

5.  The  Mammography  data  set  was  made  available,  courtesy  of  Aleksandar 
Lazarevic 


Table  2  gives  a  summary  of  these  datasets.  All  nominal  and 
binary  attributes  are  removed  to  focus  on  the  continuous¬ 
valued  attributes. 

5.1  Mass  versus  density  and  distance 

To  better  understand  the  robustness  of  ranking  using  mass,  we 
first  examine  in  this  section  how  well  the  ranking  based  on 
either  mass,  fcth  NN  distance  or  kNN  density  in  a  scenario  in 
which  the  density  of  an  anomaly  cluster  changes  with  respect 
to  the  normal  cluster.  To  achieve  this,  we  use  a  synthetic  data 
set  with  two  clusters.  Cl  and  C2.  The  normal  cluster  is  denoted 
as  Cl:  it  is  a  bivariate  normal  distribution  with  a  standard 
deviation  of  2  and  has  1850  instances.  The  anomaly  cluster  is 
denoted  as  C2:  it  has  150  instances  and  is  well  separated  from 
the  normal  cluster.  The  standard  deviation  of  C2  decreases 
from  2  to  0.2,  with  a  step  size  of  0.2,  yielding  a  ratio  of 
standard  deviations  between  Cl  and  C2  (denoted  as  Stdev 
Ratio),  changing  from  1  to  10 — simulating  an  increasingly 
denser  anomaly  cluster.  Figure  7(a)  shows  two  examples  where 
Stdev  Ratio  is  equal  to  1  and  10. 


(a)  Synthetic  Data  at  StdevRatio=l  and  StdevRatio=10 


(b)  Mass  versus  kth  NN  Distance  and  kNN  density  using  HS*-Trees 

Fig.  7.  (a)  Two  examples  of  the  synthetic  data  used,  (b) 
A  comparison  of  AUC  performance  of  FIS*-Trees  using 
either  mass,  fcth  NN  Distance  or  kNN  density. 

Figure  7(b)  shows  that  when  the  standard  deviation  ratio 
changes  from  1  to  10,  the  AUC  score  for  HS*-Trees  using 
either  fcth  NN  distance  or  kNN  density  drops  significantly. 
When  the  ratio  goes  beyond  3,  kNN  density  gives  an  AUC 
score  below  0.5.  Note  that  an  AUC  of  0.5  is  the  expected 


AUC 

Time  (Seconds) 

mass 

distance 

density 

mass 

distance 

density 

Http 

1.00 

1.00 

0.62 

88.35 

1849.21 

568.33 

ForestCover 

0.89 

0.65 

0.67 

36.33 

156.20 

131.48 

Mulcross 

0.99 

#0.00 

#0.00 

29.22 

182.71 

109.18 

Smtp 

0.90 

0.85 

0.87 

17.07 

113.59 

62.88 

Shuttle 

1.00 

0.98 

0.88 

9.75 

74.59 

34.29 

Mammography 

0.86 

0.82 

0.83 

3.41 

11.51 

7.00 

Annthyroid 

0.73 

0.73 

0.73 

2.61 

5.64 

3.68 

Satellite 

0.74 

0.74 

0.72 

2.54 

6.37 

4.94 

Pima 

0.69 

0.72 

0.72 

1.44 

1.68 

1.30 

Breastw 

0.99 

0.98 

0.98 

1.51 

1.78 

1.26 

AiThythmia 

0.84 

0.82 

0.82 

1.92 

3.78 

3.30 

Ionosphere 

0.80 

0.90 

0.90 

1.36 

1.59 

1.33 

win/draw/loss 

7/3/2 

9/1/2 

12/0/0 

9/0/3 

TABLE  3 

Result  comparing  three  ranking  measures  in  terms  of  AUC  and  runtime.  The  distance  and  density  are  computed 
based  on  k  nearest  neighbours  at  the  leaf  of  HS*-Trees.  Figures  boldfaced  are  the  best  performance  for  each  data 
set.  The  runtime  results  include  both  training  and  testing  times.  #  The  AUC  results  are  not  exactly  zero  but  a  small 

number  less  than  one-hundredth. 


score  by  random  ranking.  As  the  ratio  increases,  both  kNN 
density  and  fcth  NN  distance  begin  to  rank  normal  instances 
ahead  of  normal  instances — exactly  the  scenario  depicted  in 
(c)  in  Section  3.  In  contrast,  the  AUC  score  of  HS*-Trees  using 
mass  has  almost  perfect  score  throughout  the  entire  range. 

We  also  conduct  an  experiment  with  k-Means  using  the 
three  ranking  measures  and  it  produces  the  same  result.  This 
is  shown  in  Appendix  A — this  shows  that  the  mass  ranking 
measure  is  better  than  either  distance  or  density  measure, 
independent  of  the  specific  method  used  (HS*-Trees  or  k- 
Means.) 

It  is  important  to  note  that  while  fcth  NN  distance  is  able 
to  detect  the  clustered  anomalies  less  than  fc,  trying  to  find  an 
appropriate  fc  is  impractical  for  two  reasons.  First,  there  may 
be  more  than  one  cluster  with  differing  numbers  of  anomalies. 
Second,  the  number  of  anomalies  in  one  cluster  can  vary  from 
one  occassion  to  another,  e.g.,  from  one  sample  to  another. 
Thus,  setting  a  fixed  fc  is  not  a  solution.  In  addition,  setting  a 
large  fc  increases  the  runtime  substantially. 

Table  3  shows  the  anomaly  detection  performance  in  the 
twelve  data  sets  (listed  in  Table  2)  in  terms  of  AUC  and 
runtime  for  HS* -Trees  which  uses  each  of  the  three  measures 
to  perform  ranking.  In  terms  of  AUC,  mass  has  more  wins  than 
losses  than  either  distance  or  density,  notably  in  ForestCover 
and  Mulcross  in  which  anomaly  clusters  have  a  significant 
presence.  Only  in  the  Pima  and  Ionosphere  data  sets  mass 
loses,  and  the  difference  is  small.  These  are  likely  to  be  due 
to  scenario  (d)  mentioned  in  Section  3.2  in  which  both  density 
and  distance  have  a  slight  advantage.  In  terms  of  runtime,  HS*- 
Trees  using  mass  has  a  significant  advantage  over  all  other 
ranking  measures,  especially  in  large  data  sets.  For  example, 
in  the  largest  data  set  Http,  HS*-Trees  using  mass  takes  less 
than  one-twentieth  and  one-sixth  of  the  time  required  by  HS*- 
Trees  using  distance  and  density,  respectively. 

5.2  Compare  to  state-of-the-art  anomaly  detectors 

This  experiment  aims  to  show  that  mass-based  HS*-Trees 
performs  better,  in  terms  of  AUC  and  run  time,  than  either 


Fig.  8.  A  comparison  of  HS*-Trees,  LOF,  ORCA  and  SVM 
in  the  synthetic  data. 


distance-based  or  density-based  methods;  ORCA,  SVM  and 
LOF. 

Figure  8  shows  the  result  using  the  synthetic  data  (described 
in  Figure  7(a).)  It  shows  that  HS*-Trees  has  near  perfect 
AUC  over  the  entire  range  of  Stddev  Ratios.  In  contrast, 
methods  based  on  distance  and  density  all  perform  very  poorly 
in  detecting  clustered  anomalies  in  this  data  set,  especially 
when  clustered  anomalies  become  significantly  denser  than  the 
normal  cluster.  Note  that  LOF  performs  particularly  poorly  in 
this  data  set  because  it  can  only  detect  instances  at  the  fringes 
of  the  two  clusters  as  anomalies  and  that  does  not  change 
much  throughout  the  entire  range  of  Stddev  Ratios. 

Table  4  shows  the  anomaly  detection  performance  in  the 
twelve  data  sets  in  terms  of  AUC  and  runtime,  comparing 
HS*-Trees  with  the  three  methods:  ORCA,  SVM  and  LOF. 
In  terms  of  AUC,  HS*-Trees  has  a  better  detection  accuracy 
than  all  of  the  other  methods  with  the  following  win/draw/loss 
counts;  10/0/2  compared  to  ORCA,  11/0/1  compared  to  SVM, 
10/0/1  compared  to  LOF.  In  terms  of  runtime,  HS*-Trees  is 
significantly  faster  than  all  other  methods,  especially  in  the 
large  data  sets.  For  example,  in  the  largest  data  set  Http,  HS*- 
Trees  takes  less  than  one-hundredth  and  one-four-hundredth 
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AUC 

Time  (Seconds) 

HS*-Trees 

ORCA 

SVM 

LOF 

HS*-Trees 

ORCA 

SVM 

LOF 

Http 

1.00 

0.36 

0.90 

t 

88.35 

9487.47 

35872.09 

>  2  weeks 

ForestCover 

0.89 

0.83 

0.90 

0.57 

36.33 

6995.17 

9737.81 

224380.19 

Mulcross 

0.99 

0.33 

0.59 

0.59 

29.22 

2512.20 

7342.54 

156044.13 

Smtp 

0.90 

0.87 

0.78 

0.32 

17.07 

267.45 

986.84 

24280.65 

Shuttle 

1.00 

0.60 

0.79 

0.55 

9.75 

156.66 

332.09 

7489.74 

Mammography 

0.86 

0.77 

0.65 

0.67 

3.41 

4.49 

10.80 

14647.00 

Annthyroid 

0.73 

0.68 

0.63 

0.72 

2.61 

2.32 

4.18 

72.02 

Satellite 

0.74 

0.65 

0.61 

0.52 

2.54 

8.51 

8.97 

217.39 

Pima 

0.69 

0.71 

0.55 

0.49 

1.44 

0.06 

0.06 

1.14 

Breastw 

0.99 

0.98 

0.66 

0.37 

1.51 

0.04 

0.07 

1.77 

Arrhythmia 

0.84 

0.78 

0.71 

0.73 

1.92 

0.49 

0.15 

6.35 

Ionosphere 

0.80 

0.92 

0.71 

0.89 

1.36 

0.04 

0.04 

0.64 

win/draw/loss 

10/0/2 

11/0/1 

10/0/1 

7/0/5 

8/0/4 

9/0/2 

TABLE  4 

Result  comparing  four  anomaly  detectors  in  terms  of  AUC  and  runtime.  Figures  boldfaced  are  the  best  performance 
for  each  data  set.  f  We  do  not  have  the  full  results  for  LOF  because  it  has  a  high  computational  complexity  and  is 

unable  to  complete  the  large  data  set  in  more  than  two  weeks. 
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subsample  of  fixed  size  ■0  =  256  is  used  to  train  an  HS*-Tree. 
The  sublinear  increase  in  mntime  for  HS*-Trees  is  solely  due 
to  the  time  used  for  testing,  which  is  sublinear  to  the  size 
of  the  data  set.  In  this  case,  the  testing  time  increases  from 
0.6  to  111  seconds  when  the  data  size  increases  from  one 
thousand  to  one  million.  In  contrast,  ORCA’s  runtime  increases 
from  0.1  seconds  to  over  40000  seconds.  Moreover,  the  AUC 
performance  of  HS*-Trees  has  already  reached  0.97  from  one 
thousand  data  size;  whereas  ORCA’s  AUC  starts  at  0.02  and 
stabilises  at  0.33,  as  the  top  AUC  performance,  even  the  data 
size  increases  to  one  million. 


Fig.  9.  Run  time  (training  +  testing)  comparison  between 
HS*-Trees  and  ORCA  in  Mulcross  data  set  with  increasing 
data  sizes. 


of  the  time  required  by  ORCA  and  SVM,  respectively;  and 
the  gap  is  even  bigger  in  comparison  with  LOF — less  than 
one-thirteen-thousandth ! 

There  are  two  interesting  observations.  First,  HS*-Trees 
using  density  (result  shown  in  Table  3)  performs  better  than 
LOF — a  state-of-the  art  algorithm  using  density  measure — in 
all  data  sets  in  terms  of  both  AUC  and  mntime.  The  only 
exceptions  are  in  the  small  size  Pima  and  Ionosphere  data 
sets  in  terms  of  runtime;  and  in  Mulcross  in  terms  of  AUC. 
Second,  HS*-Trees  using  distance  performs  better  in  terms 
of  AUC  than  ORCA  (which  employs  distance  measure)  in  7 
data  sets,  draws  in  1,  and  loses  in  4  data  sets,  which  is  very 
competitive.  In  terms  of  runtime,  HS*-Trees  using  distance  is 
significantly  faster  in  all  large  data  sets — it  is  more  than  five 
times  faster  than  ORCA  in  the  largest  data  set.  Http. 

Figure  9  shows  the  runtime  comparison  between  HS* -Trees 
and  ORCA  (which  is  a  near-linear  algorithm  and  the  fastest 
among  the  three  anomaly  detectors)  using  the  Mulcross  data 
generator. 

In  this  experiment,  training  an  ensemble  of  100  HS*-Trees 
takes  constant  time  at  about  1  second,  no  matter  the  given  data 
size  is  one  thousand  or  one  million.  This  is  because  only  a 


6  Time  and  space  complexities  compari¬ 
son. 

This  section  compares  the  time  and  space  complexities  of 
three  state-of-the-art  anomaly  detectors  with  HS*-Trees.  This 
includes  DOLPHIN  [3]  —  the  latest  k-nearest  neighbour-based 
algorithm  method  which  uses  distance  for  ranking.  Table  5  lists 
the  complexities  for  the  four  algorithms. 


Time  complexity 

Space  complexity 

HS*-Trees 

0{th{n  +  tp)) 

0{thilj) 

DOLPHIN 

0{'n?d) 

0(n) 

0{|nd)t 

o(|)t 

ORCA 

0{nlogn  •  d) 

0{n) 

LOF 

0(n^d) 

Oin) 

t  Under  special  condition;  p  is  the  probability  of  randomly  picking  a  point 
from  the  data  set  which  is  a  neighbour  of  the  point  under  consideration 
using  a  search  index;  k  is  the  number  of  nearest  neighbours;  d  is  the 
number  of  dimensions. 

TABLE  5 

A  comparison  of  time  and  space  complexities.  The  time 
complexity  Includes  both  training  and  testing. 


HS*-Trees  has  a  significant  advantage  over  three  k-nearest- 
neighbours-based  methods  in  terms  of  both  time  and  space 
complexities.  This  is  mainly  due  to  the  fact  that  HS*-Trees 
only  needs  a  small  subsample  to  train  a  tree,  where  ■0  <C  n;  "0 
(also  t  and  h)  can  be  fixed  in  practice,  regardless  of  the  size 
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of  the  given  training  set,  as  demonstrated  in  the  last  section 
using  Mulcross.  Another  distinguishing  feature  is  that  the  time 
complexity  of  HS*-Trees  is  independent  of  the  dimensionality 
of  the  domain.  Thus,  the  training  time  and  memory  space 
requirement  are  fixed — these  properties  make  HS*-Trees  the 
ideal  candidate  to  apply  to  domains  with  huge  data  size  or 
infinite  data  such  as  data  stream. 

7  Relation  to  iForest 

Here  we  establish  that  the  anomaly  score  used  in  iForest  [20], 
i.e.,  path  length,  is  a  form  of  the  augmented  mass  in  Half- 
Space  Tree. 

The  definition  of  anomalies  based  on  iForest  is  given  as 
follows. 

‘Anomalies  are  the  top-ranked  instances  whose  average  path 
lengths  are  the  shortest.’ 

During  testing,  a  tree  in  iForest  computes  the  path  length 
at  an  external  node  with  m  instances  at  level  i  as  follows. 

s  =  f  -f  c(m) 

TTl  —  1 

=  £  +  2(ln(m—l) - E), 

m 

where  c(m)  is  a  function  which  estimates  the  average  path 
length  of  an  unexpanded  subtree  for  a  training  data  of  size  m, 
and  E  is  Euler’s  constant  [20]. 

Apply  logarithm  to  Equation  (2)  used  in  Half-Space  Tree 
gives 

s'  =  £  +  log{m) 

Note  that  both  s  and  s'  take  the  general  form  of  depth  level 
£  plus  a  derivative  of  mass.  In  essence,  the  path  length  used 
in  iEorest  [20]  is  a  form  of  augmented  mass  ranking  measure. 
Thus,  iEorest  is  a  kind  of  mass-based  approach. 

Another  key  difference  between  Half-Space  Tree  and  iEorest 
[20]  is  that  Half-Space  Tree  uses  the  mid-point  split  which 
guarantees  equal-size  subdivision;  whereas  iEorest  randomly 
selects  a  split  point.  The  analysis  in  Section  4.1  is  possible 
because  of  half-space  splits;  as  far  as  we  know,  there  is  no 
equivalent  analysis  exists  for  random-split. 

A  comparison  of  HS*-Trees  and  iEorest  in  terms  of  AUC 
is  given  in  Appendix  B. 

8  Related  work 

On  the  surface,  there  is  a  close  relationship  between  the 
augmented  mass  in  Equation  (2)  and  data  depth  [19]  because 
of  the  use  of  depth  level  in  the  form  of  Half-Space  Tree:  they 
both  delineate  the  centrality  of  a  data  cloud  (as  opposed  to 
compactness  in  the  case  of  density  measure.)  However,  there 
are  two  fundamental  differences.  Eirst,  mass,  by  its  definition, 
does  not  have  to  be  expressed  in  tree  form.  Eor  example,  using 
a  clustering  algorithm  to  find  regions,  and  then  apply  mass  to 
rank  the  regions  (as  shown  in  Appendix  A)  does  not  entail  the 
concept  of  depth  or  centrality  in  each  region.  Second,  mass  is 
a  simple  and  straightforward  measure;  whereas  data  depth  has 
many  different  definitions,  depending  on  the  construct  used  to 
define  depth.  The  constructs  could  be  Convex  Hull,  simplicial. 


half-space®  and  so  on  [19],  all  of  which  are  expensive  to 
compute  in  multi-dimensional  problems.  Even  when  expressed 
in  the  form  of  Half-Space  Tree,  the  augmented  mass  is  still 
simple  and  straightforward  and  can  be  traced  back  to  the 
simple  mass  definition  as  shown  in  Section  4.2. 

At  the  algorithm  level,  k-d  Tree  [17],  that  based  on  median 
split,  may  appear  to  be  similar  to  Half-Space  Tree  on  the 
surface.  However,  there  are  important  differences.  Eirst,  the 
purpose  of  the  algorithm  is  different:  k-d  Tree  is  designed  to 
speed  up  search,  e.g.,  in  a  near  neighbour  search;  whereas 
Half-Space  Tree  is  specifically  designed  for  mass  estimation 
(the  new  ranking  measure  we  proposed  here)  for  the  purpose 
of  anomaly  detection.  Second,  constructing  a  node  of  a  k-d 
Tree  starts  by  searching  for  median  as  the  splitting  point  on 
one  dimension,  and  it  cycles  through  the  dimensions  to  build 
subsequent  nodes  in  the  tree;  in  contrast,  the  splitting  point 
for  a  node  of  Half-Space  Tree  is  simply  the  mid-point  of  a 
randomly  chosen  dimension  in  the  working  space,  independent 
of  the  distribution  of  the  data — no  search  is  required  to  find 
the  split  point.  Third,  a  k-d  Tree  cannot  be  used  to  estimate 
mass  because  it  is  a  balance  tree  (due  to  the  median-split) 
and  all  external  nodes  will  have  the  same  mass — ^useless  for 
our  purpose  here!  Eourth,  a  k-d  Tree  is  constructed  using  all 
available  data;  whereas  each  Half-Space  Tree  only  requires  a 
small  training  sample;  e.g.,  only  0.045%  or  256  out  of  more 
than  half  a  million  instances  in  the  Http  data  (reported  in 
Section  5)  are  required  to  build  a  good-performing  Half-Space 
Tree.  Although  both  are  linear  time-complexity  algorithms  in 
training,  k-d  Tree  is  linear  with  respect  to  the  total  training  set 
size  n;  and  Half-Space  Tree  is  linear  with  repsect  to  ip  n. 

A  method  has  been  proposed  to  use  k-d  Tree  for  anomaly 
detection  [12].  It  partitions  the  data  into  regions  of  uniform 
density  and  then  ranks  the  regions  according  to  their  densities. 
It  will  have  problems  detecting  clustered  anomalies  because 
of  the  use  of  density  measure  to  do  ranking. 

LOCI  [26]  is  a  density-based  method  that  uses  mainly 
countings  to  compute  its  anomaly  score  because  density  (= 
number  of  instances  per  unit  space)  is  equivalent  to  counts 
when  the  space  is  the  same  for  all  density  computations.  Eor 
each  instance  p,  it  first  identifies  a  region  defined  by  a  (fixed 
user-defined)  radius  r  from  p  and  all  its  nearest  neighbours 
within  the  region.  Then,  it  counts  the  number  of  instances 
within  a  smaller  circle  of  radius  ar,  centred  at  each  nearest 
neighbour  including  p,  where  a  <  1.  The  anomaly  score, 
Multi-Granularity  Deviation  Eactor  (MDEE),  a  derivative  of 
density  ranking  measure,  is  defined  to  be  the  relative  difference 
between  the  average  count  for  all  nearest  neighbours  (n) 
and  the  count  for  p  (n),  i.e.,  {h  —  n)/h.  A  point  which 
is  surrounded  by  points  having  the  same  density  will  have 
MDEE=0.  Anomalies  will  have  MDEE  much  larger  than  0.  It 
is  a  more  computational  intensive  approach  than  Half-Space 
Tree  as  it  requires  distance  calculation  to  define  the  regions. 

Tietjen  and  Moore  [33]  describe  ‘masking  effect’  of  clus¬ 
tered  anomalies  as  follows: 

6.  The  term  ‘half-space’  has  been  used  in  geometry  to  denote  either  part 
of  the  space  divided  by  a  hyperplane;  they  are  not  required  to  have  equal-size 
space,  unlike  the  one  used  in  Half-Space  Tree. 
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“Suspected  observations  sometimes  form  subgroups;  i.e., 
several  values  are  closer  to  each  other  than  they  are  to  the 

bulk  of  the  observations . The  masking  effect  is  the  inability 

of  a  testing  procedure  to  identify  even  a  single  outlier  in  the 
presence  of  several  suspected  values.” 

Statistical  tests  are  all  affected  by  the  masking  effect  in 
different  degrees  [e.g.,  7,27].  For  example,  in  the  flow  rate 
data  set  which  has  four  anomalies  clusters  making  up  20%  of 
the  total  2589  hourly  average  flow  rate  measurements  from 
an  industrial  manufacturing  process,  Pearson  [27,  Chapter 

3]  shows  that  the  Hampter  identifier  [14]  is  better  anomaly 
detector  than  extreme  studentised  deviation  identifier  [14];  yet 
in  another  data  set  which  has  asymmetrical  distribution,  the 
Hampter  identifier  misses  some  scattered  anomaly  while  the 
asymmetrical  boxplot  is  able  to  detect  it  together  with  other 
anomalies.  All  of  these  statistical  tests  are  designed  for  single 
or  low  dimensional  problems  only  [11]. 

Note  that  the  ‘collective  outliers’  referred  to  in  [11]  are 
different  from  clustered  anomalies.  Collective  outliers  occur  in 
data  where  individual  instances  are  related,  in  e.g.,  sequence, 
spatial,  graph  and  time  series  data.  The  individual  instances 
in  collective  outliers  may  not  be  anomalies  by  themselves; 
whereas  every  individual  instance  in  clustered  anomalies  is  an 
anomaly,  without  exception. 

9  Concluding  remarks 

This  paper  introduces  a  ranking  measure,  mass,  for  anomaly 
detection.  This  measure  is  simple,  straightforward  and  fast  to 
compute.  It  is  the  only  basic  ranking  measure  that  we  know 
which  ranks  both  scattered  and  clustered  anomalies  correctly 
for  anomaly  detection  tasks. 

We  have  identified  the  key  weakness  for  the  two  commonly 
used  measures:  distance  and  density — as  a  ranking  measure, 
they  both  fail  to  rank  clustered  anomalies  correctly;  thus 
unable  to  detect  this  kind  of  anomalies.  It  is  well-known 
that  both  of  these  measures  are  computationally  expensive. 
Existing  anomaly  detection  methods  based  on  them  require 
extensive  search  and  pruning  heuristics  in  order  to  speed  up 
the  run  time  (e.g.,  [3,8]). 

Our  analysis  using  Half-Space  Tree  shows  that  mass  is  an 
effective  ranking  measure  for  both  scattered  anomalies  and 
clustered  anomalies,  and  the  depth  level  of  the  tree  is  a  proxy 
to  mass.  This  simple  method  is  shown  to  perform  better  than 
state-of-the-art  distance-based  anomaly  detectors  ORCA  and 
SVM,  and  density-based  anomaly  detector  EOF,  in  terms  of 
anomaly  detection  accuracy.  Its  time  and  space  complexities 
are  also  signihcantly  better  with  constant  training  time  and 
memory  space  requirement. 

We  reveal  that  a  previous  method  iForest  is  a  mass-based 
approach  which  employs  the  depth  level  as  a  proxy  to  mass. 
This  uncovers  the  principle  underpinning  the  method,  which 
was  previously  unknown. 

This  paper  identihes  the  source  of  failure  for  existing 
methods  to  detect  clustered  anomalies — the  use  of  density  or 
distance  as  the  ranking  measure.  There  are  a  few  attempts 
to  mitigate  the  problem  with  limited  success,  e.g.,  modifying 
the  density  measure  [32]  or  alternatively  use  a  clustering 


algorithm  to  identify  the  clustered  anomalies.  We  show  that  by 
using  the  mass  measure  and  the  proposed  HS*-Trees,  clustered 
anomalies  can  be  identihed  effectively  and  efficient  without 
resorting  to  the  use  of  clustering  algorithm. 
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Appendix  A:  Anomaly  detection  using 

CLUSTERING. 

Here  we  demonstrate  that  one  can  use  a  clustering  algorithm 
to  carve  out  the  regions  in  the  feature  space,  and  then  use  any 
one  of  the  three  ranking  measures  to  do  ranking  for  anomaly 
detection.  We  employ  a  commonly  used  clustering  algorithm 
k-Means  [24]  to  carve  out  a  pre-set  number  of  clusters  from 
data. 

We  perform  the  same  experiment  as  conducted  in  Section 
5.1.  We  compare  the  ranking  performance  using  either  one 
of  the  three  ranking  measures  after  the  k-Means  clustering 
result  in  the  synthetic  data  set  (with  increasing  densities  for 
the  anomaly  cluster.) 


Fig.  10.  A  comparison  of  AUC  performance  of  k-means 
(k=5)  using  either  mass,  fcth  NN  distance  or  kNN  density. 

Figure  10  shows  the  result  comparing  the  three  measures 
using  the  synthetic  data  set.  It  shows  the  same  relative  per¬ 
formance  between  mass  and  the  other  two  ranking  measures, 
as  we  have  seen  in  Section  5.1 — mass  is  a  better  ranking 
measure  no  matter  HS*-Trees  or  k-means  is  used  to  carve 
out  the  regions. 


Appendix  B:  HS*-Trees  versus  iForest 

This  section  compares  the  anomaly  detection  performance 
between  HS*-Trees  and  iForest  [20].  Table  6  shows  that  HS*- 
Trees  performs  better  than  iForest  in  6  data  sets,  draws  in  4, 
and  loses  only  in  2  data  sets,  in  terms  of  AUC. 


HS*-Trees 

iForest 

Http 

1.00 

1.00 

ForestCover 

0.89 

0.88 

Mulcross 

0.99 

0.97 

Smtp 

0.90 

0.88 

Shuttle 

1.00 

1.00 

Mammography 

0.86 

0.86 

Annthyroid 

0.73 

0.82 

Satellite 

0.74 

0.71 

Pima 

0.69 

0.67 

Breastw 

0.99 

0.99 

AiThythmia 

0.84 

0.80 

Ionosphere 

0.80 

0.85 

TABLE  6 

AUC  resuit  comparing  FIS*Trees  and  iForest.  Figures 
boidfaced  are  the  best  performance  for  each  data  set. 
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Abstract 

The  data  stream  problem  has  received  a  lot  of  attention  in  recent 
years.  Data  streams  are  potentially  high  speed  and  infinite,  demand¬ 
ing  efficient  algorithms  that  require  only  one  pass  over  the  data. 
Furthermore,  a  lot  of  stream  data  can  evolve  over  time,  thus  data 
stream  algorithms  must  be  able  to  adapt  to  changes  occur  at  the  in¬ 
put.  In  this  paper,  we  propose  a  new,  adaptive,  data  stream  anomaly 
detection  method,  called  Streaming  HS-Trees.  The  method  features 
a  random  tree  model  well  integrated  with  a  change  detector,  so  that 
the  model  can  adapt  to  changing  input.  We  also  identify  the  specific 
type  of  change  in  data  distribution  that  requires  model  update  in  or¬ 
der  to  maintain  high  detection  accuracy.  The  proposed  model  can 
be  constructed  without  any  data  instances  and  hence  can  be  installed 
before  the  arrival  of  a  data  stream.  It  is  highly  efficient  because  it 
requires  no  model  restructuring  when  adapting  to  new  data  distribu¬ 
tion.  Our  analysis  shows  that  Streaming  HS-Trees  has  an  amortised 
constant  time  complexity  and  has  a  constant  memory  requirement, 
independent  of  data  size.  Our  performance  study  demonstrates  the 
benefits  of  developing  an  anomaly  detection  model  that  can  adapt  to 
data  distribution  changes.  When  compared  with  two  existing  out¬ 
lier  detection  methods,  our  method  performs  favourably  in  terms  of 
both  detection  accuracy  and  runtime  performance. 

1  Introduction 

Data  streams  are  commonly  found  in  modern  data  acquisi¬ 
tion  systems.  Sensor  networks,  for  example,  generate  vast 
amount  of  data  that  must  be  analysed  in  real  time.  Data 
streams  are  potentially  infinite.  Any  off-line  learning  algo¬ 
rithms  that  attempt  to  store  all  the  data  for  analysis  will  run 
out  of  memory  space  at  some  point,  regardless  of  the  capac¬ 
ity  of  the  available  memory  space. 

In  addition,  no  single  static  model  can  accurately  anal¬ 
yse  an  entire  data  stream  that  evolves  and  experiences 
changes  in  data  distribution  over  time.  Instead,  the  model 
needs  to  adapt  to  different  parts  of  the  data  stream. 

A  data  stream  algorithm  thus  need  to  satisfy  two  con¬ 
straints.  First,  it  must  be  a  one-pass  algorithm,  i.e.,  inspect 
each  data  point  only  once — it  discards  a  data  point  before 
the  next  is  processed.  In  a  one-pass  algorithm,  the  mem¬ 
ory  requirement  never  grows,  and  it  can  analyse  potentially 
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endless  amount  of  data.  Second,  it  must  incorporate  change 
detection  and  model  update  mechanisms  into  the  method,  in 
order  to  deal  with  time-varying  data  distribution. 

This  paper  proposes  a  one-pass  anomaly  detector  for 
evolving  data  streams.  The  proposed  method  is  called 
Streaming  Half-Space  Trees  (HS-Trees). 

Streaming  HS-Trees  has  the  following  characteristics: 

i)  It  is  a  one-pass  anomaly  detection  algorithm  that  can  be 
used  to  analyse  massive  datasets  or  data  streams. 

ii)  Unlike  traditional  model-based  algorithms,  a  model  in 
Streaming  HS-Trees  can  be  built  without  any  data; 
hence  it  is  possible  to  create  HS-Trees  before  the  arrival 
of  a  data  stream. 

iii)  Each  model  stores  the  profile  of  a  data  stream  in  differ¬ 
ent  time  windows.  As  will  be  shown  later,  the  profiles 
between  two  windows  can  be  compared  easily,  offer¬ 
ing  a  simple  way  to  detect  distribution  changes  in  data 
stream. 

iv)  When  there  is  a  need  to  update  a  model,  Streaming  HS- 
Trees  does  so  by  simply  using  the  latest  profile  in  the 
latest  time  window;  this  avoids  the  need  and  the  cost  to 
modify  its  model  structure. 

v)  It  has  the  ability  to  deal  with  anomaly  detection  and  data 
distribution  change  within  a  single  framework. 

The  rest  of  this  paper  is  organized  as  follows.  Section 
2  states  the  problem  and  the  goal  of  this  paper.  Section  3 
presents  the  proposed  Streaming  HS-Trees  method.  Section 
4  describes  the  experimental  setup  and  Section  5  discusses 
the  experimental  results.  Section  6  provides  a  discussion. 
Finally,  we  conclude  this  paper  in  Section  7. 

2  Problem  Statement  and  Goal 

When  a  data  stream  arrives  continuously  in  high  speed,  it  is 
impractical  to  use  traditional  off-line  learning  methods  that 
store  all  the  data  for  analysis.  Instead,  an  on-line,  one-pass 
anomaly  detector  is  required  to  address  this  problem. 

The  underlying  profile  in  a  stream  may  change  over 
time,  causing  any  non-adaptive  anomaly  detector  to  degrade 
its  detection  accuracy.  However,  only  a  certain  type  of 
change  demands  a  model  update;  whereas  no  model  update 


is  required  in  others.  Thus,  we  need  a  change  detector  that  is 
able  to  detect  the  ‘right’  type  of  change  and  then  trigger  the 
anomaly  detector  to  adapt  to  the  new  profile. 

A  data  distribution  can  be  expressed  as  P{x,y)  which 
has  two  components:  P{x\y)  and  P{y),  where  x  is  the 
input  and  y  is  the  ‘class’.  In  the  anomaly  detection  context, 
y  €  {normal,  anomaly}.  We  describe  the  different  types 
of  change  in  data  distribution  as  follows: 

(I)  Change  due  to  normal  points  only:  P{x\y  =  normal) 

(II)  Change  due  to  anomalies  only:  P{x\y  =  anomaly) 

(III)  Change  in  proportion  of  anomalies/normal  points  only: 
P{y),  especially  an  increased  number  of  anomalies  or 
contamination  level:  P{y  =  anomaly) 

In  practice,  a  change  may  involve  any  combination  of 
the  above:  P{x,y)  =  P{x\y)P{y). 

A  model  shall  be  updated  for  any  combination  of 
changes  involving  Type  (I)  change.  However,  it  is  impera¬ 
tive  not  to  update  a  model  when  it  is  either  Type  (II)  or  Type 
(III)  change.  Type  (I)  and  Type  (II)  changes  are  collectively 
referred  to  as  concept  change  in  the  literature.  We  make  a 
distinction  here  because  the  change  in  P{x\y  =  anomaly) 
does  not  affect  the  detection  performance  of  an  anomaly  de¬ 
tector  if  it  profiles  normal  points.  In  fact,  any  attempt  to  up¬ 
date  the  model  will  result  in  a  poor  detection  accuracy  under 
this  situation. 

Problem  statement.  We  address  three  inter-related 
problems  in  evolving  data  streams.  The  first  problem  is  to 
maintain  a  high  anomaly  detection  accuracy  at  all  times  in 
evolving  data  streams.  This  requires  an  anomaly  detector 
to  (i)  have  an  ability  to  detect  change  in  data  distribution 
and  (ii)  effect  a  timely  model  update  in  order  to  adapt  to  the 
changing  data  distribution.  As  we  have  mentioned  earlier 
that  not  all  changes  in  data  distribution  warrant  a  model 
update.  Thus,  the  second  problem  is  to  devise  a  change 
detection  mechanism  which  will  detect  any  change  due  to 
P{x\y  =  normal)  only;  and  ignore  all  other  changes.  The 
third  problem  is  to  institute  a  model  update  only  when  a 
persistent  change  in  P{x\y  =  normal)  is  detected;  and  no 
model  update  if  the  change  is  found  to  be  transient. 

Goal.  Our  goal  in  this  paper  is  to  address  these  three 
problems  in  a  single  framework.  We  aim  to: 

•  introduce  an  algorithm  that  has  an  amortised  0(1)  time 
complexity,  called  Streaming  HS-Trees, 

•  demonstrate  that  Streaming  HS-Trees  can  deal  with 
anomaly  detection,  change  detection,  and  adapting  to 
data  distribution  change,  all  within  a  single  framework. 

3  The  Proposed  Method 

The  proposed  method  is  a  tree-based  ensemble  approach. 
The  model  is  in  the  form  of  Half-Space  Trees,  or  HS-Trees. 


We  describe  the  proposed  method  in  the  following  four  sec¬ 
tions.  Section  3.1  provides  the  definitions  and  algorithms 
to  construct  an  HS-Tree.  Section  3.2  describes  the  pro¬ 
posed  one-pass  algorithm.  Streaming  HS-Trees.  Section  3.3 
provides  an  analysis  of  the  time  and  space  complexities  of 
Streaming  HS-Trees.  Section  3.4  presents  the  change  detec¬ 
tion  and  model  update  mechanisms. 

The  key  symbols  and  notations  used  in  this  paper  are 
listed  in  Table  1 . 


X 

a  streaming  point 

n 

the  number  of  streaming  points 

T 

an  Half  Space  Tree,  HS-tree 

N 

a  node  in  an  HS-Tree  or  Node 

t 

the  number  of  HS-Trees  in  an  ensemble 

h 

maximum  depth  level  of  a  tree,  or  maxDepth 

min 

an  array  of  minimum  values  for  all  dimensions 

miuq 

minimum  value  of  dimension  q 

max 

an  array  of  maximum  values  for  all  dimensions 

maxq 

maximum  value  of  dimension  q 

r 

mass*  of  a  node  in  the  reference  window 

1 

mass  of  a  node  in  the  latest  window 

window  size 

A 

number  of  consecutive  windows  before  a  model 

is  updated 

s 

an  anomaly  score 

Table  1:  Key  symbols  and  notations  (*mass  is  the  number 
of  training  instances  which  traverses  through  a  node  in  an 
HS-Tree.) 


3.1  Half-Space  Trees.  A  Half-Space  Tree  is  a  balanced 
binary  tree  in  which  each  internal  node  splits  a  feature 
(sub)space  into  two  half-sized  subspaces;  and  all  external 
nodes  have  the  same  depth.  An  Half-Space  Tree  is  built  from 
a  working  space  that  is  established  by  defining  a  range  in 
every  dimension.  The  tree  building  process  begins  by  ran¬ 
domly  selecting  a  dimension  and  then  bisecting  the  working 
space  using  mid-point  of  the  selected  dimension.  This  pro¬ 
cess  continues  for  each  newly  created  node  recursively  until 
it  reaches  the  required  maximum  depth  level,  denoted  as  h 
or  maxDepth. 

Creating  diverse  HS-Trees  is  crucial  to  the  success  of 
our  method.  This  is  achieved  by  invoking  the  procedure  Ini¬ 
tialise  WorkingSpace  (Algorithm  1)  to  obtain  a  new  work¬ 
ing  space,  right  before  the  construction  of  each  tree.  The 
purpose  of  Algorithm  1  is  to  redefine  the  range  of  each  di¬ 
mension  in  the  feature  space  such  that  each  dimension  has  a 
new  range.  Since  each  tree  is  built  from  a  variant  of  the  orig¬ 
inal  space,  the  result  is  an  ensemble  of  diverse  HS-Trees. 

Algorithm  2  shows  the  procedure  for  building  a  single 
HS-Tree.  Each  internal  node  is  formed  by  randomly  select- 


Algorithm  1  ;  InitialiseWorkingSpace(_Dmin,  Dmax) 
Inputs:  Drain  &  Dmax  -  arrays  of  given  minimum  and 
maximum  values  for  every  dimension 
Output:  min  &  max  -  arrays  of  redefined  minimum  and 
maximum  values  for  every  dimension  in  a  Working  Space 

1:  for  each  dimension  <7  do 

2:  Randomly  choose  s  from  [Dminq,  DmaXq] 

3:  cr  ^  2  •  max(s  —  Dminq,  DmaXq  —  s) 

4:  minq  ^  s  —  a 

5:  maxq  ^  s  +  (7 

6:  end  for 

7:  return  min  and  max 


ing  a  dimension  (see  line  4)  to  form  two  half-spaces;  the  split 
point  is  the  mid-point  of  the  current  range  of  the  selected  di¬ 
mension.  As  a  result,  the  entire  tree  construction  procedure 
is  very  fast  because  the  process  requires  no  evaluation  crite¬ 
ria  for  dimension  or  split  point  selections. 

Each  node  has  two  mass  variables,  r  and  I,  which  record 
the  number  of  training  instances  traversing  through  it  at 
different  time  windows  of  a  data  stream.  Each  of  these 
variables  is  assigned  an  initial  value  of  zero  during  the  initial 
tree  construction  process. 

Algorithm  2  :  BuildSingleHS-Tree(mm,  max,  k) 

Inputs:  min  &  max  -  arrays  of  minimum  and  maximum 
values  for  every  dimension  in  a  Working  Space,  k  -  current 
depth  level 
Output:  an  HS-Tree 
1:  if  k  ==  maxDepth  then 
2:  return  Node(r  ^  0,1  ^  0)  {External  node} 

3:  else 

4:  randomly  select  a  dimension  q 

5:  p  ^  {maXq  +  minq)/2 

6:  (Build  two  nodes:  Left  and  Right  as  a  result  of  a 

split  into  two  equal-volume  half-spaces.} 

7:  temp  ^  maXq',  maXq  ^  p 

8:  Left  ^  BuildSingleHS-Tree(TOm,  max,  k  +  1) 

9:  maXq  ^  temp;  minq  ^  p 

10:  Right  ^  BuildSingleHS-Tree(min,  max,  k  +  1) 

11:  return  Node{Le ft,  Right,  Split Att  ^  q, 

SplitValue  ^  p,r  ^  0,1  ^  0) 

12:  end  if 


Recording  mass  profile  in  HS-Tree.  Once  a  HS-Tree  is 
constructed,  it  needs  to  build  a  mass  profile  of  the  data  before 
it  can  be  employed  for  anomaly  detection.  The  process 
involves  traversing  every  training  instance  through  the  HS- 
Tree.  The  mass  profile  of  a  node  is  simply  the  number  of 
training  instances  traversing  through  that  node.  Algorithm  3 
shows  that  training  instances  in  the  reference  time  window 


will  update  mass  r,  otherwise  mass  I  (in  the  latest  time 
window)  is  updated.  The  variables  r  and  I  are  used  in 
Streaming  HS-Trees  which  will  be  described  in  Section  3.2. 


Algorithm  3  :  UpdateMass(a;,  Node,  referenceWindow) 
Inputs:  a;  -  a  test  instance.  Node  -  a  node  in  an  HS-Tree 
Output:  none 

1:  {referenceWindow)!  Node.r++  :  Noded++ 

2:  if  {N ode. Level  <  maxDepth)  then 

3:  Let  Node'  be  the  sub-node  of  Node  that  x  traverses 

4:  UpdateMass(a;,  Node' ,  referenceWindow) 

5:  end  if 


Scoring.  After  a  mass  profile  is  recorded  in  an  HS-Tree, 
it  is  ready  to  assign  anomaly  scores  to  test  instances.  Given  a 
test  instance  x  and  an  ensemble  of  HS-Trees,  the  anomaly 
score  is  the  sum  of  score{x,  T.root)  (see  Algorithm  4) 
obtained  from  every  Half-Space  Tree  T. 

In  Algorithm  4,  sizeLimit  is  a  global  parameter  that 
ensures  that  a  score  is  obtained  from  a  node  that  has  a 
reasonable  number  of  instances.  maxDepth  is  also  a  global 
parameter;  it  ensures  that  a  tree  is  grown  to  the  specified 
maximum  depth.  Eor  our  purpose,  we  find  that  sizeLimit  = 
20  and  maxDepth  =15  give  reasonable  results  over  a  range 
of  real  and  synthetic  datasets. 


Algorithm  4  :  Score(x,  Node) 

Inputs:  a;  -  a  test  instance.  Node  -  a  node  in  an  HS-Tree 
Output:  an  anomaly  score  for  x 
1:  if  {Node  .Level  ==  maxDepth)  V  {Node.r  < 
sizeLimit)  then 
2:  return  Node.r  x 

3:  else 

4:  Let  Node'  be  the  sub-node  of  Node  that  x  traverses 

5:  Score(a;,  Aode') 

6:  end  if 


3.2  Streaming  HS-Trees.  One  interesting  aspect  of  the 
proposed  streaming  method  is  that  it  requires  no  data  to  build 
a  model — it  only  requires  the  range  of  each  dimension  in 
the  feature  space.  This  is  materially  different  from  many 
traditional  models,  which  demand  training  data  to  construct 
their  models.  As  a  result,  a  model  can  be  constructed  with 
Streaming  HS-Trees  even  before  the  actual  data  arrive,  as 
long  as  the  range  of  each  dimension  is  known  (or  estimated) 
a  priori.  Eurthermore,  traditional  algorithms  must  update 
their  model  structures  continuously  in  order  to  keep  up  with 
newly  arrived  data.  In  contrast.  Streaming  HS-Trees  builds  a 
model  structure  that  does  not  need  to  be  changed  to  keep  up 
with  newly  arrived  data. 

Algorithm  5  shows  the  operational  procedure  for 


Streaming  HS-Trees.  Line  1  builds  an  ensemble  of  Half- 
Space  Trees.  Line  2  uses  the  first  if)  instances  of  the  stream 
to  train  the  HS-Trees.  Since  these  instances  come  from  the 
initial  reference  time  window,  only  mass  r  of  each  traversed 
node  is  updated.  After  these  two  steps,  the  model  is  ready 
to  provide  an  anomaly  score  for  each  subsequent  streaming 
point. 

Streaming  HS-Trees  is  designed  to  process  the  data  in  a 
single  pass.  When  a  data  point  arrives,  it  is  traversed  from 
the  root  to  a  (terminating)  node  of  each  tree.  The  point  is 
then  discarded  before  the  next  data  point  is  processed.  This 
enables  the  model  to  deal  with  an  incoming  data  stream  by 
examining  every  data  point  only  once.  Hence,  the  proposed 
method  only  requires  a  finite  memory  to  process  an  infinitely 
long  data  stream. 

Each  node  in  an  HS-Tree  has  two  variables:  reference 
mass  r  and  latest  mass  I,  which  respectively  store  the  mass 
profiles  of  a  reference  window  and  latest  window.  Line  2  of 
the  algorithm  records  the  mass  profile  in  the  reference  mass 
r.  This  mass  is  always  used  to  compute  the  anomaly  score 
for  each  streaming  point  (line  9).  The  recording  of  mass 
for  each  subsequent  streaming  point  in  the  latest  window  is 
carried  out  on  mass  I  (line  8).  This  latest  mass  profile  is  used 
by  the  change  detector  to  decide  if  a  change  (with  respect 
to  mass  r  in  the  reference  window)  has  effected.  Each  node 
with  a  non-zero  mass  I  is  reset  at  the  end  of  each  window 
(line  19). 

In  Streaming  HS-Trees,  the  model  is  updated  only  after 
a  (persistent)  distribution  change  is  detected  over  A  consec¬ 
utive  windows  (line  15).  This  is  to  avoid  model  update  over 
a  transient  distribution  change.  Model  update  is  surprisingly 
simple  and  no  structural  change  of  the  model  is  required. 
The  model  is  updated  to  the  latest  mass  before  the  start  of 
the  next  window  by  simply  transferring  the  non-zero  mass  I 
to  r  (line  16). 

3.3  Time  and  space  complexities.  Here,  we  analyse  the 
amortised  time  complexity  of  Streaming  HS-Trees.  Eor  n 
streaming  points,  where  n  >  ip ,  there  are  n  predictions,  ^ 
change  detections,  and  at  most  model  updates.  There  are 
five  key  operations  in  the  main  loop  of  Algorithm  5,  as  listed 
in  Table  2. 

The  first  operation  is  to  update  the  mass  variable  I  in  all 
nodes  along  a  path  from  the  root  of  a  tree  to  the  maximum 
depth  of  /i;  and  this  is  required  for  all  t  trees.  The  second 
operation  is  to  provide  a  score  which  takes  0{h)  for  each  tree 
and  a  sum  of  scores  over  t  trees.  These  two  operations  are  to 
be  carried  out  for  each  streaming  point.  The  third  operation 
is  to  detect  change  (the  details  are  provided  in  Section  3.4) 
which  needs  to  access  a  maximum  of  ip  nodes  in  each  tree; 
and  this  operation  is  performed  at  the  end  of  each  window. 
The  fourth  operation  is  to  update  the  model  to  the  lastest 
mass  profile  which  takes  0{ip)  number  of  assignments  r  ^  I 


Algorithm  5  :  Streaming  HS-Trees(i/',  t) 

Inputs:  Ip  -  Window  Size,  t  -  number  of  HS-Trees 
Output:  s  -  anomaly  score  for  each  streaming  point 

1:  Build  t  HS-Trees  :  call  Algorithms  1  &  2  for  each  tree 
2:  Record  a  reference  mass  profile  in  HS-Trees:  for  each 
tree  T,  invoke  UpdateMass(a;,  T.roof,  frwe)  for  each 
item  X  in  the  first  ip  instances  of  the  stream 
3:  Count  ^  0 

4:  while  data  stream  continues  do 
5:  Receive  the  next  streaming  point  x 

6:  S  ^  0 

7:  for  each  tree  T  in  HS-Trees  do 

8:  UpdateMass(x,  T. root, false)  {update  mass  /  in  T} 

9:  s  ^  s  -I-  Score(x,  T.root)  (accumulate  scores} 

10:  end  for 

11:  Report  s  as  the  anomaly  score  for  x 

12:  Count++ 

13:  if  Count  ==  Ip  then 

14:  ChangeDetectedl  ppChange++  :  ppChange  ^  0 

15:  if  PpChange  >  A  then 

16:  Update  model  :  N.r  ^  N.l  for  each  non-zero 

mass  node  N 

17:  ppChange  ^  0 

18:  end  if 

19:  Reset  N.l  ^  0  for  all  non-zero  mass  node  N 

20:  Count  ^  0 

21:  end  if 

22:  end  while 


for  all  non-zero  mass  nodes.  The  operation  only  needs  to  be 
done  at  most  ^  times.  The  fifth  operation  is  to  reset  N.l  for 
all  nodes  N  with  non-zero  mass.  This  operation  involves  a 
maximum  of  ip  non-zero  nodes;  it  is  done  for  every  tree  and 
needs  to  be  carried  out  at  the  end  of  each  window. 

In  summary,  model  update  has  the  smallest  cost  among 
the  five  operations,  while  update  mass  I  and  score(a:,  T.root) 
have  the  largest  cost. 

Thus,  for  Streaming  HS-Trees,  the  average  time  cost  of 
each  operation  in  the  worse  case  for  n  streaming  points  is 

TM  =  oitih  +  i  +  ^)). 

n  2a 

Note  that  the  above  amortised  cost  is  independent  of 
data  size  n  or  the  window  size  ip  or  the  number  of  dimen¬ 
sions.  Since  the  parameters  t,  h  and  A  are  algorithmic  pa¬ 
rameters  independent  of  data  size,  our  proposed  algorithm 
Streaming  HS-Trees  is  an  amortised  0(1)  algorithm. 

The  space  complexity  for  HS-Trees  is  0{t2^)  and  is  a 
constant  for  an  ensemble  with  fixed  maximum  depth  level 
(h)  and  ensemble  size  (t).  Note  that  this  is  also  independent 
of  data  size. 


Operation  (Op)  Line# 

Cost/Op 

#Op 

1 .  Update  mass  1 

8 

th 

n 

2.  Score(a;,  T.root) 

9 

th 

n 

3 .  Change  detection 

14 

tip 

n 

4.  Update  model 

16 

tip 

n 

5.  Reset  N.l  ^  0 

19 

tip 

n 

-0 

Total  cost,  T (n) 

=  0{2nt{h  -f  1  -f 

Table  2:  Amortised  Analysis  for  Streaming  HS-Trees:  Total 
time  cost  for  n  streaming  points.  Line#  refers  to  the  line 
number  in  Algorithm  5. 


3.4  Change  detection.  The  idea  is  to  capture  significant 
changes  in  mass  within  a  subspace  defined  by  branch  in  an 
HS-Tree.  Given  a  subspace  which  has  a  high  mass  in  a 
reference  window.  If  this  subspace  has  a  significant  change 
in  mass  in  the  latest  window  (compared  to  the  reference 
window),  then  a  significant  change  in  normal  points  have 
effected.  This  is  because  a  high-mass  subspace  is  normally 
associated  with  normal  points.  Changes  in  the  high-mass 
subspaces  indicate  changes  in  distribution  due  to  the  normal 
points  (i.e.,  P{x\y  =  normal)).  When  this  occurs,  the 
model  shall  be  updated  to  the  latest  mass  profile.  We  give 
a  more  precise  definition  of  our  change  detection  method 
below. 

Let  L  be  a  set  of  Node  with  non-zero  (reference  and 
latest)  mass  of  HS-Trees;  that  is: 

L  =  {Node  :  {Node.r  >  0)  V  {Node.l  >  0)} 

The  mean  of  all  non-zero  mass  Node.r  of  HS-Trees  is 
given  as: 

T  =  — —  •  Node.r 

L  ^ 

Node^L 

The  set  of  high-mass  Node  is  defined  as: 

Lhigh  =  {Node  :  {Node  &  L)  A  {Node.r  >  T)} 

Lhigh  contains  the  mass  profile  of  a  set  of  subspaces 
with  high  mass.  Examining  the  changes  of  this  profile  allows 
us  to  describe  the  changes  which  have  taken  place  in  a  data 
distribution. 

The  percentage  of  change  in  the  high-mass  profile  of  a 
latest  window  with  respect  to  that  of  a  reference  time  window 
is  given  as  follows: 

^  \Node.r  -  Node.l\ 

Node.r 

3.4.1  Change  detection  for  model  update.  We  now  de¬ 
fine  the  required  amount  of  change  in  the  high-mass  profile 
to  be  considered  as  Targe  enough’  to  update  a  model.  Let  d^, 


be  the  percentage  of  change  in  the  high-mass  profile,  where 
00  is  the  index  of  each  time  window. 

We  estimate  the  average  of  d  using  an  exponential 
update  rule: 

doj-t-i  —  ^  T  (1  ck)  *  d^ 

Here,  a  is  the  smoothing  constant  in  the  range  of 
[0, 1]  and  a  higher  a-value  gives  more  weight  to  recent 
observations. 

The  average  deviation  6  from  the  average  of  d  is  defined 
similarly: 

^  d^  \  (1  Cx)  * 

Let  dai+m  be  a  change  that  occurs  at  m  number  of 
windows  after  its  reference  time  window  lo.  This  change  is 
considered  large  if 

du.\-m  P  d^  ~\~  T  • 

where  m  >  1  and  t  define  the  amount  of  deviation  that  is 
considered  to  be  Targe  enough’ . 

The  model  is  only  updated  if  changes  are  detected  in  A 
consecutive  windows.  Based  on  this,  a  change  is  categorised 
as: 

•  a  transient  change  if  m  <  A; 

•  a  persistent  change  if  m  >  A. 

3.4.2  Model  updating  schemes.  The  proposed  model  up¬ 
date  scheme  based  on  persistent  change  is  called  Selec¬ 
tive  Update  scheme,  denoted  as  SU.  We  will  compare  this 
scheme  with  two  other  schemes.  The  first  scheme  as¬ 
sumes  that  an  initial  model  can  be  used  throughout  the  entire 
stream.  We  denoted  this  scheme  as  NoU,  which  stands  for 
No  Update.  The  second  scheme  Always  Updates  (denoted 
as  AU)  the  model  at  each  time  window. 

Table  3  summarises  the  strengths  and  weaknesses  of 
these  three  schemes  under  different  conditions.  NoU  is  ex¬ 
pected  to  work  well  when  there  is  no  change  in  the  distri¬ 
bution  of  normal  points;  otherwise  NoU  will  fail  because 
the  initial  model  will  become  irrelevant  after  a  change  due 
X.O  P{x\y  =  normal)  has  occurred.  AU  is  expected  to  cope 
well  when  there  is  a  distribution  change  in  the  normal  points; 
but  it  fails  when  the  change  is  due  to  P{x\y  =  anomaly) 
or  P{y  =  anomaly).  SU  is  expected  to  work  well  in 
most  cases,  except  when  there  is  a  transient  change  due  to 
P{x\y  =  normal).  In  this  case,  SU  will  miss  to  update  its 
model  and  therefore  will  not  perform  well  during  the  short 
period  in  which  a  transient  change  occurs. 

Why  a  delay  in  A  windows  is  required  before  a  model 
update?  When  there  is  a  combination  of  changes  involving 
both  P{x\y  =  normal)  and  P{x\y  =  anomaly),  a  delay 
will  avoid  transient  Type  (I)  change  and  still  be  able  to  detect 
anomalies  over  the  duration  of  this  change.  Otherwise,  the 
anomaly  detector  will  fail  to  detect  anomalies  during  this 
transient  Type  (I)  change  while  adapting  to  it.  Note  that  this 
failure  can  be  very  serious  as  the  Type  (II)  change  could 


Type  of  Change 

SU 

NoU 

AU 

Type  (I) 

•/  (persistent) 
/(transient) 

/ 

/ 

Type  (II) 

/ 

/ 

/ 

Type  (III) 

/ 

/ 

/ 

Table  3;  A  summary  of  the  working  conditions  for  the  three 
model  updating  schemes.  /  means  a  scheme  can  cope  with 
a  change.  X  means  a  scheme  cannot  cope  with  a  change. 


Figure  2:  Case  1 :  A  shift  in  normal  cluster  centroid.  This  is 
a  Type  (I)  change  due  to  P{x\y  =  normal). 


accompany  a  significantly  increased  number  of  anomalies 
(i.e.,  Type  (III)  change).  We  demonstrate  this  combination 
of  changes  using  one  example  in  Section  3.4.3  and  its  effect 
in  Section  5.1.1. 

3.4.3  Changes  in  high-mass  profile  when  data  distribu¬ 
tion  changes.  The  aim  of  this  section  is  to  provide  a  sys¬ 
tematic  analysis  of  four  possible  scenarios  associated  with 
distribution  change  in  the  context  of  anomaly  detection.  To 
facilitate  our  analysis,  we  will  consider  a  synthetic  dataset 
with  two  Gaussian  clusters — a  big  normal  cluster  in  which 
the  centroid  is  at  the  origin;  a  small  anomalous  cluster  with 
about  9%  the  size  of  the  normal  cluster,  and  its  points  are 
scattered.  An  example  of  the  data  distribution  before  any 
changes  occur  is  as  shown  in  Figure  1 .  This  distribution  will 
undergo  a  change  in  the  middle  of  the  streaming  process. 


Figure  1;  The  original  synthetic  dataset  consists  of  a  big 
normal  cluster  and  a  small  abnormal  cluster. 


number  of  anomalies  (i.e.,  P{y  =  anomaly)  as  well  as 
an  increase  in  density  (i.e.,  P{x\y  =  anomaly).  The  third 
case  involves  an  increase  in  density  of  anomalies  only  (i.e., 
P{x\y  =  anomaly)).  These  two  cases  should  not  trigger 
a  model  update  because  there  is  no  change  in  the  normal 
points. 


Case  2:  Anomalies  increase  in  number  and  density 
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Case  3:  Anomalies  increase  in  density 


The  first  case  considered  here  is  associated  with  dis¬ 
tribution  change  due  to  normal  points  only  (i.e.,  P{x\y  = 
normal),  as  shown  in  Figure  2.  This  case  is  concerned  with 
a  shift  of  the  normal  cluster  to  a  completely  new  centroid  lo¬ 
cation.  An  anomaly  detection  model  must  adapt  to  the  new 
data  distribution  in  order  to  maintain  its  detection  accuracy. 

The  second  and  third  cases  are  associated  with  changes 
in  anomalies  only.  Figure  3  depicts  these  two  cases.  The 
second  case  is  associated  with  a  three-fold  increase  in  the 


Figure  3:  Case  2  has  Type  (II)  and  Type  (III)  changes.  Case 
3  has  Type  (II)  change  only. 

The  fourth  and  the  last  case  is  as  shown  in  Figure 
4.  This  case  is  more  complicated  because  it  involves  the 
following  changes:  (i)  a  shift  of  the  centroid  of  the  normal 
cluster,  and  (ii)  an  increase  in  the  number  and  density  of  the 
anomalies,  where  the  anomalies  occur  as  short  bursts  during 
the  streaming  process.  We  expect  a  model  update  to  occur 


when  there  is  a  distribution  change  of  the  normal  points. 
However,  when  there  are  short  bursts  of  anomalies,  we  do 
not  want  to  update  the  model. 
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Figure  4:  Case  4:  A  change  of  data  distribution  due  to 
a  combination  of  several  changes  in  normal  points  and 
anomalies.  These  Type  (I),  Type  (II)  and  Type  (III)  changes. 

Figure  5  shows  the  changes  in  the  high-mass  profile 
associated  with  two  synthetic  cases  identified  earlier.  It 
is  easy  to  see  that  the  high-mass  profile  changes  quite 
significantly  for  Case  1  at  the  middle  of  the  stream  in  which 
Type  (I)  change  has  occurred.  When  there  are  Type  (II)  and 
Type  (III)  changes  which  involve  anomalies  only,  as  in  Case 
2,  the  change  in  the  high-mass  profile  is  relatively  small. 
These  examples  show  that  analysing  the  changes  in  the  high- 
mass  profile  is  an  effective  way  to  detect  distribution  change 
due  X.O  P{x\y  =  normal),  or  Type  (I)  change. 


datasets  have  a  temporal  aspect  in  their  data  sequence, 
which  resembles  the  properties  of  streaming  data.  HTTP 
is  characterised  by  its  sudden  bursts  of  anomalies  in  some 
streaming  segments.  SMTP  does  not  produce  bursts  of 
anomalies,  but  possibly  exhibits  some  distribution  changes 
within  the  streaming  sequence. 

In  practice,  it  is  hard  to  quantify  whether  a  distribution 
change  has  indeed  occurred  within  a  stream.  For  this 
reason,  we  have  derived  a  dataset,  SMTPh-HTTP,  which 
contains  the  SMTP  data  instances  follow  by  the  HTTP  data 
instances.  When  viewed  as  an  entire  stream,  we  expect 
a  distribution  change  to  occur  when  the  communication 
protocol  is  switched  from  SMTP  to  HTTP. 

We  also  use  the  COVERTYPE  and  SHUTTLE  datasets 
from  the  UCI  Machine  Learning  Repository  [2].  COVER- 
TYPE  is  a  relatively  large  dataset  and  is  commonly  used  in 
data  stream  research.  We  split  the  anomaly  class  into  several 
small  groups  and  placed  them  in  different  segments  of  the 
dataset.  As  will  be  shown  later,  this  simulates  short  bursts 
of  anomalies  in  different  streaming  segments.  As  for  the 
SHUTTLE  dataset,  it  represents  a  situation  where  there  is 
little  or  no  distribution  change. 

The  last  dataset  we  use  is  MULCROSS  [11].  This 
dataset  contains  dense  clusters  of  anomalies  that  are  harder  to 
detect  than  scattered  anomalies.  We  expect  many  traditional 
anomaly  detection  methods  to  fail  in  detecting  the  dense 
anomalies  in  this  dataset. 

Table  4  gives  a  summary  of  the  real  and  synthetic  data 
used  in  this  study. 


Progression  of  data  stream 


N 

D 

anomaly  class 

Synthetic  Case  1,3 

4400 

2 

class  2  (9%) 

Synthetic  Case  2,4 

4800 

2 

class  2  (16.7%) 

SMTP  (KDD  Cup  99) 

95156 

3 

attack  (0.03%) 

HTTP  (KDD  Cup  99) 

567497 

3 

attack  (0.4%) 

SMTP  H-  HTTP 

662653 

3 

attack  (0.35%) 

COVERTYPE 

286048 

10 

class  4  (0.9%) 

vs.  class  2 

SHUTTLE 

49097 

9 

class  2,3,5-7  (7%) 

MULCROSS 

262144 

4 

2  clusters  (10%) 

Eigure  5;  An  example  of  high-mass  profile  changes  in  two 
different  synthetic  cases. 


4  Experimental  Setup 

4.1  Data.  We  use  the  four  synthetic  datasets  presented  in 
Section  3.4.3.  Each  of  these  synthetic  datasets  has  a  very 
specific  property  and  allows  us  to  analyse  the  conditions 
under  which  a  method  could  fail. 

We  also  use  six  large  datasets  from  different  domains. 
The  first  two  datasets  are  SMTP  and  HTTP  from  KDD 
CUP  99  network  intrusion  data  as  used  in  [13].  These 


Table  4:  A  summary  of  the  characteristics  of  datasets  used. 
N  is  the  number  of  instances  and  D  is  the  number  of  dimen¬ 
sions.  The  percentage  in  bracket  indicates  the  percentage  of 
anomalies. 


4.2  Settings.  Eor  all  the  experiments,  we  conducted  ten  in¬ 
dependent  runs  of  each  algorithms  on  each  dataset.  Each  run 
was  conducted  as  a  single  threaded  job  processed  at  2.3GHz 
in  a  Linux  cluster  (www.vpac.org).  Once  the  anomaly  scores 
for  all  instances  are  obtained,  the  results  can  be  evaluated 
by  using  the  anomaly  scores  to  rank  the  instances.  Normal 


points  are  expected  to  have  high  scores,  whereas  anomalies 
are  expected  to  have  low  scores.  From  this  ranking  and  the 
ground  truth,  we  then  compute  the  AUC  (Area  Under  re¬ 
ceiver  operating  characteristic  Curve)  [4]  to  measure  the  per¬ 
formance  of  all  anomaly  detectors  reported  in  this  paper.  The 
actual  CPU  times  are  also  reported  for  the  six  large  datasets. 

The  parameter  settings  for  Streaming  HS-Trees  are  as 
follows.  We  use  an  ensemble  size  (t)  of  25  trees.  For  the 
exponential  update  of  changes  in  mass  profile,  we  set  the 
smoothing  constant  a  at  0.3.  The  window  size  (^/;)  is  250. 
The  threshold  (r  )  for  detecting  a  change  in  the  high-mass 
profile  is  4.  The  number  of  consecutive  changes  (A)  for 
persistent  change  is  set  at  1  for  all  the  small  synthetic  cases. 
For  the  six  large  datasets,  A  is  fixed  at  4.  All  these  settings 
remain  unchanged  throughout  the  experiments. 

5  Experimental  Results 

In  this  section,  we  report  the  results  of  two  experiments. 
The  aim  of  the  first  experiment  is  to  examine  the  effective¬ 
ness  of  the  change  detector  in  Streaming  HS-Trees  for  per¬ 
forming  model  updates  in  order  to  cope  with  evolving  data 
streams.  To  demonstrate  the  practical  benefits  of  the  pro¬ 
posed  method,  we  compare  Streaming  HS-Trees  with  the 
three  model  updating  schemes  described  in  Section  3.4.2, 
namely  Selective  Update  (SU),  No  Update  (NoU)  and  Al¬ 
ways  Update  (AU).  Results  of  this  experiment  will  be  re¬ 
ported  in  Section  5.1. 

The  aim  of  the  second  experiment  is  to  examine 
how  Streaming  HS-Trees  fares  in  comparison  with  existing 
anomaly  detection  methods.  The  benchmarking  methods  in¬ 
clude  ORCA  -  a  distance-based  anomaly  method  based  on  k- 
Nearest  Neighbours  (fc-nn)  [3]  and  One-Class  Support  Vec¬ 
tor  Machine  (SVM)  [12].  ORCA  is  selected  because  it  is  a 
major  improvement  of  /c-distance  anomaly  detection  imple¬ 
mentation  in  terms  of  efficiency.  Using  a  pruning  rule  with 
randomly  ordered  samples,  the  time  complexity  of  ORCA  is 
reduced  to  near  linear.  The  parameters  of  ORCA  is  (A:  =  5 
and  N  =  n/8)  following  the  same  treatment  as  in  [10].  As 
for  One-Class  SVM,  we  apply  the  Radial  Basis  Function  ker¬ 
nel  and  parameter  setting  as  suggested  in  [5].  Results  for  this 
experiment  will  be  reported  in  Section  5.2. 

5.1  Results  of  Streamiug  HS-Trees  with  three  model 
updatiug  schemes.  The  presentation  of  results  is  organised 
as  follows.  Section  5.1.1  provides  an  analysis  of  the  results 
on  the  four  synthetic  cases.  Section  5.1.2  gives  a  detailed 
analysis  of  the  results  on  large  datasets.  Section  5.1.3  gives 
an  analysis  of  the  speed  of  processing  the  largest  dataset  (i.e., 
SMTPh-HTTP)  used  in  this  study. 

5.1.1  Results  ou  syuthetic  cases.  Table  5  shows  that 
Streaming  HS-Trees  using  the  SU  scheme  is  the  most  ro¬ 
bust  in  terms  of  the  overall  AUC  scores.  Its  results  are  ei¬ 


ther  ranked  first  or  second  for  the  synthetic  data.  Notice  that 
the  performance  of  Streaming  HS-Trees  using  SU  scheme 
is  slightly  lower  than  that  of  using  AU  scheme  in  Synthetic 
Case  1 .  This  is  because  SU  scheme  updates  its  model  at  the 
next  time  window  after  a  change  was  detected,  this  degrades 
its  performance  slightly.  As  will  be  demonstrated  in  Sec¬ 
tion  5.1.2,  this  delay  in  model  update  is  quite  small  for  large 
streaming  data  and  therefore  the  effects  on  the  detection  ac¬ 
curacy  is  not  substantial. 

Table  5  shows  that  NoU  only  works  well  when  there  is 
no  changes  in  the  high-mass  distribution  (i.e..  Cases  2  and 
3).  It  fails  badly  when  a  major  distribution  change  occurs  in 
Cases  1  and  4.  This  is  because  the  initial  model  used  in  NoU 
is  no  longer  relevant  after  a  distribution  change  in  normal 
points  has  occurred  in  the  middle  of  the  stream. 

AU  works  well  when  there  is  a  distribution  change  in 
normal  points  only  (i.e..  Synthetic  Case  1).  However,  its 
performance  degrades  when  there  is  a  sudden  increase  in 
the  number  of  anomalies  in  Cases  2  and  3.  This  is  because 
the  model  updates  itself  and  treats  some  of  the  anomalies 
as  normal  points.  Furthermore,  AU’s  performance  becomes 
poorer  when  the  anomalies  occur  in  short  bursts  during  the 
streaming  process,  as  in  Case  4. 


Dataset 

SU 

NoU 

AU 

Synthetic  Case  1 

0.935 

0.519 

0.989 

Synthetic  Case  2 

0.996 

0.996 

0.940 

Synthetic  Case  3 

0.997 

0.992 

0.990 

Synthetic  Case  4 

0.982 

0.531 

0.789 

HTTP 

0.998 

0.984 

0.143 

SMTP 

0.858 

0.753 

0.874 

SMTP  H-  HTTP 

0.994 

0.403 

0.262 

COVERTYPE 

0.915 

0.855 

0.743 

SHUTTLE 

0.997 

0.997 

0.997 

MULCROSS 

0.974 

0.980 

0.965 

Table  5:  A  summary  of  the  overall  AUC  scores  for  the 
synthetic  and  real  data  for  Streaming  HS-Trees  using  the 
three  model  updating  schemes:  NoU,  AU  and  SU.  The  best 
result  is  boldfaced  and  underlined;  the  second  best  result  is 
boldfaced. 


5.1.2  Results  on  large  streaming  data.  Table  5  shows 
that  the  performance  of  SU  is  the  most  consistent  throughout 
all  the  datasets.  We  will  discuss  these  results  in  conjunction 
with  the  AUC  performance  in  different  streaming  segments 
in  each  dataset. 

HTTP;  AU  performs  a  lot  worse  than  both  SU  and  NoU 
in  this  dataset.  From  our  analysis  summarised  in  Table  3,  it 
is  likely  that  there  are  TYPE  (II)  and/or  TYPE  (III)  changes. 
Eigure  6  reveals  that  the  bursts  of  anomalies  occur  at  segment 
3  and  the  change  continues  to  segment  4  to  a  lesser  extend. 


NoU  and  SU  perform  well  in  this  dataset.  NoU  will 
never  update  its  model  and  therefore  detect  the  bursts  of 
anomalies  successfully.  SU  regards  the  bursts  of  anomalies 
as  transient  changes  and  never  updates  its  model  to  these 
changes.  Hence  it  also  performs  well  on  this  dataset  with 
the  highest  AUC  of  0.998  as  reported  in  Table  5. 


IbSU  DNoU  DAuI  Progression  of  data  stream  (120000) 


Figure  6;  HTTP  -  AUC  scores  over  five  segments.  The 
number  on  top  of  each  segment  is  the  total  number  of 
anomalies  in  that  segment. 

SMTP:  AU  and  SU  have  similar  performance  in  this 
data  set,  but  NoU  performs  poorly.  This  scenario  is  likely 
to  be  due  to  Type  (I)  change.  Indeed,  Figure  7  shows  that 
NoU  has  performed  poorer  than  both  AU  and  SU  in  all  five 
segments.  Segment  3  could  be  a  result  of  a  combination  of 
changes  in  both  transient  Type  (I)  and  Type  (II)  (or  Type 
(III)) — this  causes  both  AU  and  SU  to  perform  poorly  as 
well. 
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■  sun  NoU  □  AU  I  Progression  of  data  stream  (20000) 


Figure  7:  SMTP  -  AUC  scores  over  five  segments. 

SMTP+HTTP;  Because  of  a  persistent  change  of 
P{x\y)  at  the  changeover  from  SMTP  to  HTTP  in  segment  1, 
NoU  performs  poorly  throughout  the  HTTP  stream  from  seg¬ 
ment  3  to  segment  5  in  Figure  8.  This  is  because  NoU  does 
not  update  its  model  that  was  previously  learned  from  the 
SMTP  stream  in  segment  1 .  The  behaviour  of  AU  and  SU  are 
consistent  with  their  performance  previously  shown  in  Fig¬ 
ures  6  and  7.  Finally,  SU  has  the  best  result  in  this  dataset — 
its  AUC  remains  high  throughout  the  entire  stream.  This  is 
because  SU  successfully  updates  its  model  when  there  is  a 


change  in  the  protocol,  while  avoiding  to  update  its  model 
when  there  is  a  burst  of  anomalies. 
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BSU  DNoU  □  AU|  Progression  of  data  stream  (1 50000) 


Figure  8:  SMTPh-HTTP  -  AUC  scores  over  five  segments. 


COVERTYPE:  This  dataset  demonstrates  the  scenario 
reminiscent  to  synthetic  Case  4  in  which  there  is  a  combina¬ 
tion  of  persistent  changes  of  Type  (I),  Type  (III)  and/or  Type 
(II)  in  which  only  SU  performs  well  out  of  the  three  schemes. 
Both  AU  and  NoU  notably  perform  poorer  in  segments  3  and 
4  when  there  are  bursts  of  anomalies  as  shown  in  Figure  9. 
SU  suffers  a  little  at  the  beginning  of  segment  1 ;  this  is  likely 
to  be  due  to  one  or  two  transient  changes  of  Type  (I). 


Figure  9:  COVERTYPE  -  AUC  scores  over  five  segments. 

MULCROSS  and  SHUTTLE:  All  schemes  do  well  in 
these  two  datasets.  It  is  likely  that  there  is  little  or  no  changes 
in  data  distribution  in  them,  as  demonstrated  in  all  segments 
of  Eigures  10  (a)  and  (b). 

Our  results  show  that  SU  can  strike  a  balance  between 
no  adaptation  to  changes  in  data  distribution,  as  in  NoU; 
and  over-adaptation,  as  in  AU.  As  a  result.  Streaming  HS- 
Trees  adopting  the  SU  scheme  is  more  robust  in  evolving 
data  streams.  Table  6  shows  that  SU  only  performs  a  small 
number  of  model  updates  compared  to  the  high  number  of 
updates  performed  by  AU.  Eor  example,  SU  performs  only  3 
updates  on  the  SMTPh-HTTP  dataset,  compared  to  no  update 
in  NoU  and  2650  updates  in  AU;  yet  its  AUC  performance  is 
significantly  higher  than  NoU  and  AU. 
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SU  DNoU  □  AU|  Progression  of  data  stream  (60000) 


(a)  MULCROSS 


■  SU  DNoU  DAU 


Progression  of  data  stream  (10000) 


(b)  SHUTTLE 


Figure  10:  MULCROSS  and  SHUTTLE  -  AUC  scores  over 
five  segments. 


No.  of  Updates 
Dataset  SU  AU 


HTTP 

1 

2269 

SMTP 

2 

380 

SMTPh-HTTP 

3 

2650 

COVERTYPE 

1 

1144 

MULCROSS 

1 

1048 

SHUTTLE 

1 

196 

Table  6:  Number  of  updates  executed  by  SU  and  AU. 


5.1.3  Analysis  of  stream  processing  speed.  Figure  11 
shows  that,  when  processing  the  SMTP+HTTP  dataset, 
Streaming  HS-Trees  can  process  at  least  20,000  items  per 
second.  This  is  a  reasonable  speed  for  processing  real 
streaming  data  or  large  static  datasets.  NoU  is  the  most  ef¬ 
ficient  scheme  since  it  does  not  perform  any  model  update. 
AU  always  update  its  model  and  thus  it  is  not  as  fast  as  NoU. 
The  speed  of  SU  is  in  between  the  speeds  of  AU  and  NoU. 

5.2  Comparison  with  ORCA  and  One-Class  SVM.  In 

this  section,  we  compare  Streaming  HS-Trees  (using  the  SU 
scheme)  to  ORCA  and  SVM.  Table  7  shows  that  Streaming 
HS-Trees  significantly  outperforms  ORCA  and  SVM,  both 
in  terms  of  AUC  and  runtime. 

We  see  that  Streaming  HS-Trees  attains  an  AUC  score 
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Progression  of  data  stream 


Figure  11:  The  number  of  instances  processed  per  sec¬ 
ond  over  each  segment  in  the  data  stream  (SMTPh-HTTP 
dataset). 


of  almost  one  (rounded  to  the  nearest  two  decimal  places)  on 
the  SHUTTLE,  MULCROSS  and  HTTP  datasets,  whereas 
ORCA  and  SVM  perform  significantly  poorer.  In  particular, 
ORCA  does  not  perform  well  with  MULCROSS  and  HTTP 
because  of  the  density  of  the  anomaly  class  is  higher  than 
normal  instances.  This  reverses  the  ranking  even  to  rank 
normal  instances  first  before  anomalies,  causing  a  poorer 
than  average  AUC  score. 

Notice  that  One-Class  SVM  performs  reasonably  well 
on  SMTP  and  HTTP,  with  AUC  scores  of  0.78  and  0.90  re¬ 
spectively.  However,  when  these  two  datasets  are  combined 
as  SMTPh-HTTP,  One-Class  SVM  fails  quite  badly.  This  is 
because  the  mixed  distributions  in  SMTPh-HTTP  is  too  com¬ 
plex  for  One-Class  SVM  to  handle,  resulting  in  a  poor  AUC 
score  of  0.43 

Recall  that  Streaming  HS-Trees  is  a  one-pass  algo¬ 
rithm,  restricting  itself  to  inspect  each  data  point  only  once, 
whereas  ORCA  and  SVM  requires  all  the  data  to  be  load  onto 
working  memory.  Table  7  shows  that  the  efforts  of  building  a 
one-pass  Streaming  HS-Trees  translates  well  into  a  very  fast 
processing  speed.  In  contrast,  One-Class  SVM  is  the  slowest 
because  of  the  need  to  perform  optimization  during  its  model 
building  process.  ORCA  is  faster  than  SVM,  but  it  is  still  (on 
average)  267  times  slower  than  Streaming  HS-Tree. 

Another  observation  is  that  ORCA  takes  a  longer  time 
(13197  seconds)  to  process  SMTPh-HTTP,  compared  to  the 
total  time  (267  H-  9487  seconds)  to  process  SMTP  and  HTTP 
datasets  individually.  This  suggests  that  the  processing  time 
of  ORCA  becomes  longer  when  the  (combined)  distribution 
becomes  more  complex.  On  the  other  hand.  Streaming  HS- 
Trees  takes  a  relatively  shorter  time  (35  seconds)  to  process 
SMTPh-HTTP,  compared  to  the  total  time  (9  H-  32  seconds) 
to  process  SMTP  and  HTTP  datasets  individually.  This 
suggests  that  the  processing  time  of  Streaming  HS-Trees  is 
not  sensitive  to  more  complex  data  distributions. 


AUC 

Runtime  (seconds) 

Streaming 

Streaming 

HS-Trees 

ORCA 

SVM 

HS-Trees 

ORCA 

SVM 

SHUTTLE 

1.00 

0.60 

0.79 

7.09 

155.66 

332.09 

MULCROSS 

0.97 

0.33 

0.59 

18.14 

2512.20 

7342.54 

HTTP 

1.00 

0.36 

0.90 

32.07 

9487.47 

35872.09 

SMTP 

0.86 

0.80 

0.78 

9.07 

267.45 

986.84 

SMTPh-HTTP 

0.99 

0.38 

0.43 

35.10 

13197.04 

34918.70 

COVERTYPE 

0.92 

0.83 

0.90 

21.13 

6995.17 

9737.81 

Table  7:  Streaming  HS-Trees  (using  SU  scheme)  performs  favourably  to  ORCA  and  SVM,  both  in  terms  of  AUC  and  total 
processing  time.  Boldfaced  entires  are  the  best  results. 


6  Discussion 

Many  methods  in  data  stream  research  employ  the  sliding 
window  approach,  for  example,  Concept-adapting  Very  Fast 
Decision  Trees  algorithm  (CVFDT)  [8].  Like  Very  Fast  De¬ 
cision  Trees  algorithm  (VFDT)  [6],  CVFDT  uses  Hoeffd- 
ing  bound  as  a  change  detection  mechanism  to  determine 
whether  a  model  update  is  warranted.  Here  we  highlight  two 
key  differences  in  comparison  with  Streaming  Half-Space 
Trees.  First,  the  change  detection  mechanism  in  CVFDT 
does  not  take  into  account  the  types  of  change  and  the  two 
categories  of  change  (i.e.,  transient  change  and  persistent 
change)  we  have  discussed  in  this  paper.  While  one  may 
argue  that  this  is  not  required  in  the  classification  tasks,  we 
suspect  that  this  consideration  will  bring  about  more  insights 
into  data  stream  issues  in  classification  tasks,  especially  in 
skewed  class  distribution  problems. 

Second,  most  sliding-window-based  methods  are  known 
to  be  sensitive  to  the  window  size:  if  the  window  is  too  large, 
the  model  will  perform  poorly  when  there  is  a  change  in 
data  distribution;  if  the  window  is  too  small,  then  the  model 
will  be  inaccurate  because  of  small  training  data  size.  This 
applies  to  CVFDT.  As  shown  by  our  results.  Streaming  HS- 
Trees  only  needs  a  small  amount  of  training  data  in  order  for 
an  ensemble  to  perform  well.  The  results  of  MULCROSS 
and  SHUTTLE  in  Table  7  shows  that  Half-Space  Trees 
performs  significantly  better  than  ORCA  and  SVM,  even 
though  ORCA  and  SVM  employ  all  available  data  (50,000 
and  260,000  instances)  for  training  whereas  each  Half-Tree 
is  trained  from  250  instances  only.  Note  that  these  two 
datasets  have  no  change  in  data  distribution  which  is  the 
perfect  condition  for  both  ORCA  and  SVM. 

There  are  some  change  detection  methods  proposed  in 
the  literature,  e.g.,  the  dynamic  weighted  majority  algorithm 
(DWM)  [9],  and  other  variants  [7].  DWM  maintains  an 
ensemble  of  base  learners  in  classification  tasks  and  predict 
using  a  weighted  majority  vote;  and  it  dynamically  creates 
and  deletes  base  learners  in  response  to  changes  in  prediction 
performance.  Thus,  the  change  detection  mechanism  is 


solely  based  on  the  prediction  accuracy  to  weight  and  discard 
each  learner.  Gao  et  al.  [7]  uses  a  data  set  with  balanced 
distribution  to  train  each  model  in  the  ensemble  to  deal  with 
skew  class  distributions.  Although  these  methods  are  generic 
change  detection  methods,  it  is  unclear  they  can  deal  with 
window  size  issue;  and  because  they  do  not  assess  the  data 
distribution  change  directly,  they  are  unable  to  deal  with 
different  types  of  change. 

A  framework  for  on-demand  classification  of  evolving 
data  streams  [1]  enables  simultaneous  training  and  testing 
streams  to  be  used  for  classification.  In  contrast,  our  Stream¬ 
ing  Half-Space  Trees  uses  the  same  stream  for  training  and 
prediction. 

The  key  limitation  of  Streaming  HS-Trees  is  space 
complexity  which  is  exponential  to  the  tree  height  h.  The 
flip  side  of  this  limitation  has  many  advantages,  i.e.,  the 
model  structure  can  be  built  without  training  data,  and  it  only 
needs  to  be  built  once;  the  model  structure  does  not  change 
and  the  memory  space  stays  constant  throughout  the  entire 
data  stream.  An  alternative  is  to  build  a  model  when  the 
training  data  is  available,  and  retrain  another  one  if  it  needs 
to  be  updated.  While  this  may  reduce  the  overall  memory 
requirement,  it  takes  up  previous  time  for  training  a  new 
model  in  every  model  update;  thus  reduce  the  number  of 
instances  that  can  be  processed — this  can  be  critical  in  many 
real-world  applications. 

We  have  shown  that  the  Selective  Update  scheme  takes 
the  best  from  two  other  schemes:  No  Update  and  Always 
Update.  However,  it  is  important  to  be  aware  of  the  con¬ 
straint  imposed  by  A  consecutive  windows  to  define  a  per¬ 
sistent  change:  (i)  the  anomaly  detector  will  perform  poorly 
within  the  A  windows  under  type  (I)  persistent  change.  But 
this  is  limited  to  A  windows  only — an  error  guarantee  in  the 
SU  scheme  that  cannot  be  found  in  the  other  two  schemes, 
(ii)  If  there  are  many  transient  type  (I)  changes,  then  SU  will 
perform  poorly.  In  practice,  a  data  distribution  change  often 
involves  a  combination  of  different  types  of  change  which 
we  have  demonstrated  that  SU  is  more  robust  than  AU  and 
NoU  in  real-world  scenarios. 


7  Concluding  Remarks 

The  proposed  anomaly  detection  algorithm,  Streaming  HS- 
Trees,  satisfies  the  key  constraints  for  mining  evolving  data 
streams: 

•  It  is  a  one-pass  algorithm  with  amortised  0(1)  time 
complexity  and  0(1)  space  complexity — this  allows  it 
to  deal  with  huge  datasets  or  infinite  data  streams. 

•  It  incorporates  three  mechanisms:  anomaly  detection, 
change  detection,  and  model  update,  in  a  single  frame¬ 
work. 

The  use  of  Half-Space  Trees  in  the  framework  brings 
about  the  following  features: 

1)  An  HS-Tree  structure  can  be  built  without  any  data. 

2)  Data  profile  can  be  updated  incrementally  in  the  HS-Tree 
structure  as  the  data  stream  progresses. 

3)  The  HS-Tree  structure  only  needs  to  be  constructed  once 
and  use  throughout  its  entire  life  span,  even  when  model 
updates  are  required  during  the  streaming  process. 

4)  Model  updates  are  simple  and  efficient. 

Most  existing  algorithms  have  only  one  or  two  of  the 
above-mentioned  features;  incorporating  all  of  the  above 
features  within  a  single  method  is  a  rarity. 

We  have  identified  the  specific  type  of  change  in  data 
distribution  in  evolving  data  streams  that  requires  model 
update  in  order  to  maintain  high  detection  accuracy.  We 
have  also  identified  other  types  of  change  that  should  not 
trigger  a  model  update;  otherwise  the  detection  performance 
will  degrade.  This  has  led  us  to  devise  an  effective  change 
detection  and  model  update  mechanism  in  the  framework. 

We  have  shown  in  our  empirical  evaluation  that  Stream¬ 
ing  HS-Trees  with  the  selective  update  scheme  is  more  ro¬ 
bust  in  various  scenarios  in  evolving  data  streams  than  two 
other  schemes:  always  update  and  no  update.  We  have  also 
shown  that  Streaming  HS-Trees  significantly  outperforms 
two  state-of-the-art  anomaly  detection  algorithms  in  terms 
of  both  detection  accuracy  and  runtime. 
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